Add Moonshine (#34784)
* config draft * full encoder forward * full decoder forward * fix sdpa and FA2 * fix sdpa and FA2 * moonshine model * moonshine model forward * fix attention with past_key_values * add MoonshineForConditionalGeneration * fix cache handling and causality for cross attention * no causal attention mask for the encoder * model addition (imports etc) * small nit * nits * Update src/transformers/models/moonshine/convert_usefulsensors_to_hf.py Co-authored-by: Joshua Lochner <admin@xenova.com> * add rope_theta * nits * model doc * Update src/transformers/models/auto/configuration_auto.py Co-authored-by: Joshua Lochner <admin@xenova.com> * imports * add MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES * updates modular * make * make fix-copies * ruff check examples fix * fix check_modular_conversion * nit * nits * nits * copied from -> imports * imports fix * integrate attention refacto * modular edge case * remove encoder * convolutions params in config * run modular_model_converter * make * Update docs/source/en/model_doc/moonshine.md Co-authored-by: Joshua Lochner <admin@xenova.com> * MoonshineModelTest * correct typo * make style * integration tests * make * modular convert * name conversion update (up_proj -> fc1 etc) * update config * update MLP * update attention * update encoder layer * update decoder layer * update convolutions parameters * update encoder * remove INPUTS_DOCSTRING * update decoder * update conditional generation * update pretrained model * imports * modular converted * update doc * fix * typo * update doc * update license * update init * split config in file * two classes for MLP * attention from GLM * from GlmRotaryEmbedding * split MLP * apply arthur's review suggestions * apply arthur's review suggestions * apply arthur's review suggestions * auto feature extractor * convert modular * fix + make * convert modular * make * unsplit config * use correct checkpoint * wrap generate * update tests * typos * make * typo * update doc --------- Co-authored-by: Joshua Lochner <admin@xenova.com>
This commit is contained in:
@@ -505,7 +505,9 @@
|
||||
- local: model_doc/mobilebert
|
||||
title: MobileBERT
|
||||
- local: model_doc/modernbert
|
||||
title: ModernBERT
|
||||
title: ModernBert
|
||||
- local: model_doc/moonshine
|
||||
title: moonshine
|
||||
- local: model_doc/mpnet
|
||||
title: MPNet
|
||||
- local: model_doc/mpt
|
||||
|
||||
@@ -235,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| [MobileViT](model_doc/mobilevit) | ✅ | ✅ | ❌ |
|
||||
| [MobileViTV2](model_doc/mobilevitv2) | ✅ | ❌ | ❌ |
|
||||
| [ModernBERT](model_doc/modernbert) | ✅ | ❌ | ❌ |
|
||||
| [Moonshine](model_doc/moonshine) | ✅ | ❌ | ❌ |
|
||||
| [Moshi](model_doc/moshi) | ✅ | ❌ | ❌ |
|
||||
| [MPNet](model_doc/mpnet) | ✅ | ✅ | ❌ |
|
||||
| [MPT](model_doc/mpt) | ✅ | ❌ | ❌ |
|
||||
|
||||
56
docs/source/en/model_doc/moonshine.md
Normal file
56
docs/source/en/model_doc/moonshine.md
Normal file
@@ -0,0 +1,56 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Moonshine
|
||||
|
||||
## Overview
|
||||
|
||||
The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands
|
||||
](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.*
|
||||
|
||||
Tips:
|
||||
|
||||
- Moonshine improves upon Whisper's architecture:
|
||||
1. It uses SwiGLU activation instead of GELU in the decoder layers
|
||||
2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows.
|
||||
|
||||
This model was contributed by [Eustache Le Bihan (eustlb)](https://huggingface.co/eustlb).
|
||||
The original code can be found [here](https://github.com/usefulsensors/moonshine).
|
||||
|
||||
## Resources
|
||||
|
||||
- [Automatic speech recognition task guide](../tasks/asr)
|
||||
|
||||
## MoonshineConfig
|
||||
|
||||
[[autodoc]] MoonshineConfig
|
||||
|
||||
## MoonshineModel
|
||||
|
||||
[[autodoc]] MoonshineModel
|
||||
- forward
|
||||
- _mask_input_features
|
||||
|
||||
## MoonshineForConditionalGeneration
|
||||
|
||||
[[autodoc]] MoonshineForConditionalGeneration
|
||||
- forward
|
||||
- generate
|
||||
|
||||
@@ -68,6 +68,7 @@ FlashAttention-2 is currently supported for the following architectures:
|
||||
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
|
||||
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
|
||||
* [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision)
|
||||
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
|
||||
* [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)
|
||||
* [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)
|
||||
* [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava)
|
||||
@@ -265,6 +266,7 @@ For now, Transformers supports SDPA inference and training for the following arc
|
||||
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
|
||||
* [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision)
|
||||
* [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100#transformers.M2M100Model)
|
||||
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
|
||||
* [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)
|
||||
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
|
||||
* [Mllama](https://huggingface.co/docs/transformers/model_doc/mllama#transformers.MllamaForConditionalGeneration)
|
||||
@@ -283,8 +285,8 @@ For now, Transformers supports SDPA inference and training for the following arc
|
||||
* [Phi3](https://huggingface.co/docs/transformers/model_doc/phi3#transformers.Phi3Model)
|
||||
* [PhiMoE](https://huggingface.co/docs/transformers/model_doc/phimoe#transformers.PhimoeModel)
|
||||
* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
|
||||
* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)
|
||||
* [mBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel)
|
||||
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
|
||||
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
|
||||
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
|
||||
* [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel)
|
||||
|
||||
Reference in New Issue
Block a user