From df017c3ccc7355c81c0200b31709f36f0a455310 Mon Sep 17 00:00:00 2001 From: Arthur <48595927+ArthurZucker@users.noreply.github.com> Date: Mon, 24 Apr 2023 14:00:29 +0200 Subject: [PATCH] [CLAP] Doc nits (#22957) clap nits --- docs/source/en/model_doc/clap.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/model_doc/clap.mdx b/docs/source/en/model_doc/clap.mdx index fa9abacbaf..2074934deb 100644 --- a/docs/source/en/model_doc/clap.mdx +++ b/docs/source/en/model_doc/clap.mdx @@ -14,10 +14,10 @@ specific language governing permissions and limitations under the License. ## Overview -The CLAP model was proposed in [Large Scale Constrastive Laungaue-Audio pretraining with +The CLAP model was proposed in [Large Scale Contrastive Language-Audio pretraining with feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. -CLAP (Constrastive Laungaue-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score. +CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score. The abstract from the paper is the following: