diff --git a/docs/source/ar/llm_tutorial_optimization.md b/docs/source/ar/llm_tutorial_optimization.md index 6697da612c..9ddf0b59c9 100644 --- a/docs/source/ar/llm_tutorial_optimization.md +++ b/docs/source/ar/llm_tutorial_optimization.md @@ -13,7 +13,7 @@ في هذا الدليل، سنستعرض التقنيات الفعالة لتُحسِّن من كفاءة نشر نماذج اللغة الكبيرة: -1. سنتناول تقنية "دقة أقل" التي أثبتت الأبحاث فعاليتها في تحقيق مزايا حسابية دون التأثير بشكل ملحوظ على أداء النموذج عن طريق العمل بدقة رقمية أقل [8 بت و4 بت](/main_classes/quantization.md). +1. سنتناول تقنية "دقة أقل" التي أثبتت الأبحاث فعاليتها في تحقيق مزايا حسابية دون التأثير بشكل ملحوظ على أداء النموذج عن طريق العمل بدقة رقمية أقل [8 بت و4 بت](/main_classes/quantization). 2. **اFlash Attention:** إن Flash Attention وهي نسخة مُعدَّلة من خوارزمية الانتباه التي لا توفر فقط نهجًا أكثر كفاءة في استخدام الذاكرة، ولكنها تحقق أيضًا كفاءة متزايدة بسبب الاستخدام الأمثل لذاكرة GPU. diff --git a/docs/source/en/conversations.md b/docs/source/en/conversations.md index 3c5734b60e..4d416ef8fb 100644 --- a/docs/source/en/conversations.md +++ b/docs/source/en/conversations.md @@ -27,7 +27,7 @@ This guide shows you how to quickly start chatting with Transformers from the co ## chat CLI -After you've [installed Transformers](./installation.md), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session. +After you've [installed Transformers](./installation), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session. ```bash transformers chat Qwen/Qwen2.5-0.5B-Instruct @@ -158,4 +158,4 @@ The easiest solution for improving generation speed is to either quantize a mode You can also try techniques like [speculative decoding](./generation_strategies#speculative-decoding), where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token per `forward` pass. This significantly alleviates the bandwidth bottleneck and improves generation speed. > [!TIP] -> Parameters may not be active for every generated token in MoE models such as [Mixtral](./model_doc/mixtral), [Qwen2MoE](./model_doc/qwen2_moe.md), and [DBRX](./model_doc/dbrx). As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because parameters become activated with each new speculated token. +> Parameters may not be active for every generated token in MoE models such as [Mixtral](./model_doc/mixtral), [Qwen2MoE](./model_doc/qwen2_moe), and [DBRX](./model_doc/dbrx). As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because parameters become activated with each new speculated token. diff --git a/docs/source/en/llm_tutorial.md b/docs/source/en/llm_tutorial.md index 68de916998..1f6e166d88 100644 --- a/docs/source/en/llm_tutorial.md +++ b/docs/source/en/llm_tutorial.md @@ -148,9 +148,9 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) | Option name | Type | Simplified description | |---|---|---| | `max_new_tokens` | `int` | Controls the maximum generation length. Be sure to define it, as it usually defaults to a small value. | -| `do_sample` | `bool` | Defines whether generation will sample the next token (`True`), or is greedy instead (`False`). Most use cases should set this flag to `True`. Check [this guide](./generation_strategies.md) for more information. | +| `do_sample` | `bool` | Defines whether generation will sample the next token (`True`), or is greedy instead (`False`). Most use cases should set this flag to `True`. Check [this guide](./generation_strategies) for more information. | | `temperature` | `float` | How unpredictable the next selected token will be. High values (`>0.8`) are good for creative tasks, low values (e.g. `<0.4`) for tasks that require "thinking". Requires `do_sample=True`. | -| `num_beams` | `int` | When set to `>1`, activates the beam search algorithm. Beam search is good on input-grounded tasks. Check [this guide](./generation_strategies.md) for more information. | +| `num_beams` | `int` | When set to `>1`, activates the beam search algorithm. Beam search is good on input-grounded tasks. Check [this guide](./generation_strategies) for more information. | | `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. | | `eos_token_id` | `list[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. | diff --git a/docs/source/en/llm_tutorial_optimization.md b/docs/source/en/llm_tutorial_optimization.md index b6df3aa137..5f48adb5b8 100644 --- a/docs/source/en/llm_tutorial_optimization.md +++ b/docs/source/en/llm_tutorial_optimization.md @@ -23,7 +23,7 @@ The crux of these challenges lies in augmenting the computational and memory cap In this guide, we will go over the effective techniques for efficient LLM deployment: -1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization.md) can achieve computational advantages without a considerable decline in model performance. +1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance. 2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization. diff --git a/docs/source/en/model_doc/colpali.md b/docs/source/en/model_doc/colpali.md index 84d0e087b3..e9c4db5ef0 100644 --- a/docs/source/en/model_doc/colpali.md +++ b/docs/source/en/model_doc/colpali.md @@ -95,7 +95,7 @@ images = [ Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. -The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4. +The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4. ```python import requests diff --git a/docs/source/en/model_doc/colqwen2.md b/docs/source/en/model_doc/colqwen2.md index 8a1a4de6ce..654a6129a3 100644 --- a/docs/source/en/model_doc/colqwen2.md +++ b/docs/source/en/model_doc/colqwen2.md @@ -99,7 +99,7 @@ images = [ Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. -The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4. +The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4. ```python import requests diff --git a/docs/source/en/model_doc/csm.md b/docs/source/en/model_doc/csm.md index 833ddb697b..0abd7a73a5 100644 --- a/docs/source/en/model_doc/csm.md +++ b/docs/source/en/model_doc/csm.md @@ -21,7 +21,7 @@ rendered properly in your Markdown viewer. The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model [released by Sesame](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio. **Model Architecture:** -CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model [Mimi](./mimi.md), introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio. +CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model [Mimi](./mimi), introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio. The original csm-1b checkpoint is available under the [Sesame](https://huggingface.co/sesame/csm-1b) organization on Hugging Face. diff --git a/docs/source/en/model_doc/dia.md b/docs/source/en/model_doc/dia.md index a4a2f84c78..07a241efe9 100644 --- a/docs/source/en/model_doc/dia.md +++ b/docs/source/en/model_doc/dia.md @@ -26,14 +26,14 @@ rendered properly in your Markdown viewer. ## Overview -Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by [Nari Labs](https://huggingface.co/nari-labs). -It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. +Dia is an open-source text-to-speech (TTS) model (1.6B parameters) developed by [Nari Labs](https://huggingface.co/nari-labs). +It can generate highly realistic dialogue from transcript including non-verbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning). **Model Architecture:** Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while -for the audio portion (decoder), a pretrained codec model [DAC](./dac.md) is used - DAC encodes speech into discrete codebook +for the audio portion (decoder), a pretrained codec model [DAC](./dac) is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio. ## Usage Tips diff --git a/docs/source/en/model_doc/ernie.md b/docs/source/en/model_doc/ernie.md index 8b2290ded6..4d076c95d4 100644 --- a/docs/source/en/model_doc/ernie.md +++ b/docs/source/en/model_doc/ernie.md @@ -27,7 +27,7 @@ rendered properly in your Markdown viewer. ERNIE (Enhanced Representation through kNowledge IntEgration) is designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking. -Other ERNIE models released by baidu can be found at [Ernie 4.5](./ernie4_5.md), and [Ernie 4.5 MoE](./ernie4_5_moe.md). +Other ERNIE models released by baidu can be found at [Ernie 4.5](./ernie4_5), and [Ernie 4.5 MoE](./ernie4_5_moe). > [!TIP] > This model was contributed by [nghuyong](https://huggingface.co/nghuyong), and the official code can be found in [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) (in PaddlePaddle). diff --git a/docs/source/en/model_doc/ernie4_5.md b/docs/source/en/model_doc/ernie4_5.md index af24c0b8bf..c9f4c04356 100644 --- a/docs/source/en/model_doc/ernie4_5.md +++ b/docs/source/en/model_doc/ernie4_5.md @@ -29,9 +29,9 @@ rendered properly in your Markdown viewer. The Ernie 4.5 model was released in the [Ernie 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/) release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text -model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard [Llama](./llama.md) at its core. +model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard [Llama](./llama) at its core. -Other models from the family can be found at [Ernie 4.5 Moe](./ernie4_5_moe.md). +Other models from the family can be found at [Ernie 4.5 Moe](./ernie4_5_moe).
diff --git a/docs/source/en/model_doc/ernie4_5_moe.md b/docs/source/en/model_doc/ernie4_5_moe.md index 8c3e75593c..f16cfb1f92 100644 --- a/docs/source/en/model_doc/ernie4_5_moe.md +++ b/docs/source/en/model_doc/ernie4_5_moe.md @@ -30,10 +30,10 @@ rendered properly in your Markdown viewer. The Ernie 4.5 Moe model was released in the [Ernie 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/) release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model with mixture of experts (moe) - one with 21B total, 3B active parameters and another one with 300B total, 47B active parameters. -It uses the standard [Llama](./llama.md) at its core combined with a specialized MoE based on [Mixtral](./mixtral.md) with additional shared +It uses the standard [Llama](./llama) at its core combined with a specialized MoE based on [Mixtral](./mixtral) with additional shared experts. -Other models from the family can be found at [Ernie 4.5](./ernie4_5.md). +Other models from the family can be found at [Ernie 4.5](./ernie4_5).
diff --git a/docs/source/en/model_doc/gemma3n.md b/docs/source/en/model_doc/gemma3n.md index 94558ae191..803940b6f2 100644 --- a/docs/source/en/model_doc/gemma3n.md +++ b/docs/source/en/model_doc/gemma3n.md @@ -30,7 +30,7 @@ Gemma3n is a multimodal model with pretrained and instruction-tuned variants, av large portions of the language model architecture are shared with prior Gemma releases, there are many new additions in this model, including [Alternating Updates][altup] (AltUp), [Learned Augmented Residual Layer][laurel] (LAuReL), [MatFormer][matformer], Per-Layer Embeddings (PLE), [Activation Sparsity with Statistical Top-k][spark-transformer], and KV cache sharing. The language model uses -a similar attention pattern to [Gemma 3](./gemma3.md) with alternating 4 local sliding window self-attention layers for +a similar attention pattern to [Gemma 3](./gemma3) with alternating 4 local sliding window self-attention layers for every global self-attention layer with a maximum context length of 32k tokens. Gemma 3n introduces [MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a newly trained audio encoder based on the [Universal Speech Model][usm] (USM) architecture. diff --git a/docs/source/en/model_doc/idefics2.md b/docs/source/en/model_doc/idefics2.md index 1bdb7a0b16..58e0dd0ecb 100644 --- a/docs/source/en/model_doc/idefics2.md +++ b/docs/source/en/model_doc/idefics2.md @@ -169,9 +169,9 @@ model = Idefics2ForConditionalGeneration.from_pretrained( ## Shrinking down Idefics2 using quantization -As the Idefics2 model has 8 billion parameters, that would require about 16GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), that requires only about 3.5GB of RAM. +As the Idefics2 model has 8 billion parameters, that would require about 16GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization). If the model is quantized to 4 bits (or half a byte per parameter), that requires only about 3.5GB of RAM. -Quantizing a model is as simple as passing a `quantization_config` to the model. One can change the code snippet above with the changes below. We'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization.md) for other quantization methods): +Quantizing a model is as simple as passing a `quantization_config` to the model. One can change the code snippet above with the changes below. We'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization) for other quantization methods): ```diff + from transformers import BitsAndBytesConfig @@ -193,7 +193,7 @@ model = Idefics2ForConditionalGeneration.from_pretrained( A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Idefics2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. -- A notebook on how to fine-tune Idefics2 on a custom dataset using the [Trainer](../main_classes/trainer.md) can be found [here](https://colab.research.google.com/drive/1NtcTgRbSBKN7pYD3Vdx1j9m8pt3fhFDB?usp=sharing). It supports both full fine-tuning as well as (quantized) LoRa. +- A notebook on how to fine-tune Idefics2 on a custom dataset using the [Trainer](../main_classes/trainer) can be found [here](https://colab.research.google.com/drive/1NtcTgRbSBKN7pYD3Vdx1j9m8pt3fhFDB?usp=sharing). It supports both full fine-tuning as well as (quantized) LoRa. - A script regarding how to fine-tune Idefics2 using the TRL library can be found [here](https://gist.github.com/edbeeching/228652fc6c2b29a1641be5a5778223cb). - Demo notebook regarding fine-tuning Idefics2 for JSON extraction use cases can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Idefics2). 🌎 diff --git a/docs/source/en/model_doc/mimi.md b/docs/source/en/model_doc/mimi.md index 6e68394fca..da4697fae8 100644 --- a/docs/source/en/model_doc/mimi.md +++ b/docs/source/en/model_doc/mimi.md @@ -30,7 +30,7 @@ The abstract from the paper is the following: *We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.* -Its architecture is based on [Encodec](model_doc/encodec) with several major differences: +Its architecture is based on [Encodec](./encodec) with several major differences: * it uses a much lower frame-rate. * it uses additional transformers for encoding and decoding for better latent contextualization * it uses a different quantization scheme: one codebook is dedicated to semantic projection. diff --git a/docs/source/en/model_doc/minimax.md b/docs/source/en/model_doc/minimax.md index d4b9e56f0b..258f3ff303 100644 --- a/docs/source/en/model_doc/minimax.md +++ b/docs/source/en/model_doc/minimax.md @@ -115,9 +115,9 @@ The Flash Attention-2 model uses also a more memory efficient cache slicing mech ## Shrinking down MiniMax using quantization -As the MiniMax model has 456 billion parameters, that would require about 912GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), about 228 GB of RAM is required. +As the MiniMax model has 456 billion parameters, that would require about 912GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization). If the model is quantized to 4 bits (or half a byte per parameter), about 228 GB of RAM is required. -Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization.md) for alternative quantization methods): +Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization) for alternative quantization methods): ```python >>> import torch diff --git a/docs/source/en/model_doc/mixtral.md b/docs/source/en/model_doc/mixtral.md index 8b07aff7fa..73172f3168 100644 --- a/docs/source/en/model_doc/mixtral.md +++ b/docs/source/en/model_doc/mixtral.md @@ -146,9 +146,9 @@ The Flash Attention-2 model uses also a more memory efficient cache slicing mech ## Shrinking down Mixtral using quantization -As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required. +As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required. -Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization.md) for alternative quantization methods): +Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization) for alternative quantization methods): ```python >>> import torch diff --git a/docs/source/en/model_doc/patchtsmixer.md b/docs/source/en/model_doc/patchtsmixer.md index 3093206793..15573330a0 100644 --- a/docs/source/en/model_doc/patchtsmixer.md +++ b/docs/source/en/model_doc/patchtsmixer.md @@ -38,7 +38,7 @@ This model was contributed by [ajati](https://huggingface.co/ajati), [vijaye12]( ## Usage example -The code snippet below shows how to randomly initialize a PatchTSMixer model. The model is compatible with the [Trainer API](../trainer.md). +The code snippet below shows how to randomly initialize a PatchTSMixer model. The model is compatible with the [Trainer API](../trainer). ```python diff --git a/docs/source/en/model_doc/zamba.md b/docs/source/en/model_doc/zamba.md index a6a7ee38cf..b112c92d53 100644 --- a/docs/source/en/model_doc/zamba.md +++ b/docs/source/en/model_doc/zamba.md @@ -69,11 +69,11 @@ print(tokenizer.decode(outputs[0])) ## Model card The model cards can be found at: -* [Zamba-7B](MODEL_CARD_ZAMBA-7B-v1.md) +* [Zamba-7B](https://huggingface.co/Zyphra/Zamba-7B-v1) ## Issues -For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/zyphra/zamba-7b) +For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/Zyphra/Zamba-7B-v1/discussions) ## License diff --git a/docs/source/en/tasks/keypoint_detection.md b/docs/source/en/tasks/keypoint_detection.md index a0ec71a5c2..3a5871d01a 100644 --- a/docs/source/en/tasks/keypoint_detection.md +++ b/docs/source/en/tasks/keypoint_detection.md @@ -25,7 +25,7 @@ Keypoint detection identifies and locates specific points of interest within an In this guide, we will show how to extract keypoints from images. -For this tutorial, we will use [SuperPoint](./model_doc/superpoint.md), a foundation model for keypoint detection. +For this tutorial, we will use [SuperPoint](./model_doc/superpoint), a foundation model for keypoint detection. ```python from transformers import AutoImageProcessor, SuperPointForKeypointDetection diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md index 4c10907e45..a7aafdf195 100644 --- a/docs/source/en/tasks/video_text_to_text.md +++ b/docs/source/en/tasks/video_text_to_text.md @@ -20,7 +20,7 @@ rendered properly in your Markdown viewer. Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning. -These models have nearly the same architecture as [image-text-to-text](../image_text_to_text.md) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `