From 504cd71a6b172f177e6da513bea94fadb18ad99c Mon Sep 17 00:00:00 2001
From: Akash Mahajan <akash7190@gmail.com>
Date: Thu, 13 Oct 2022 01:39:03 -0700
Subject: [PATCH] add a note to whisper docs clarifying support of long-form
 decoding (#19497)

---
 docs/source/en/model_doc/whisper.mdx | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/en/model_doc/whisper.mdx b/docs/source/en/model_doc/whisper.mdx
index 40485337cd..6e88651d7e 100644
--- a/docs/source/en/model_doc/whisper.mdx
+++ b/docs/source/en/model_doc/whisper.mdx
@@ -25,6 +25,7 @@ Tips:
 
 - The model usually performs well without requiring any finetuning.
 - The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation_utils.GenerationMixin.generate`] function for inference.
+- Inference is currently only implemented for short-form i.e. audio is pre-segmented into <=30s segments. Long-form (including timestamps) will be implemented in a future release.
 - One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text.
 
 This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).