From 504cd71a6b172f177e6da513bea94fadb18ad99c Mon Sep 17 00:00:00 2001 From: Akash Mahajan Date: Thu, 13 Oct 2022 01:39:03 -0700 Subject: [PATCH] add a note to whisper docs clarifying support of long-form decoding (#19497) --- docs/source/en/model_doc/whisper.mdx | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/en/model_doc/whisper.mdx b/docs/source/en/model_doc/whisper.mdx index 40485337cd..6e88651d7e 100644 --- a/docs/source/en/model_doc/whisper.mdx +++ b/docs/source/en/model_doc/whisper.mdx @@ -25,6 +25,7 @@ Tips: - The model usually performs well without requiring any finetuning. - The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation_utils.GenerationMixin.generate`] function for inference. +- Inference is currently only implemented for short-form i.e. audio is pre-segmented into <=30s segments. Long-form (including timestamps) will be implemented in a future release. - One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text. This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).