[MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 (#23813)
* add fine-tuned with adapter layer * Add set_target_lang to tokenizer * Implement load adapter * add tests * make style * Apply suggestions from code review * Update src/transformers/models/wav2vec2/tokenization_wav2vec2.py * make fix-copies * Apply suggestions from code review * make fix-copies * make style again * mkae style again * fix doc string * Update tests/models/wav2vec2/test_tokenization_wav2vec2.py * Apply suggestions from code review * fix * Correct wav2vec2 adapter * mkae style * Update src/transformers/models/wav2vec2/modeling_wav2vec2.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * add more nice docs * finish * finish * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review * all finish --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
f49a3453ca
commit
5dfd407b37
118
docs/source/en/model_doc/mms.mdx
Normal file
118
docs/source/en/model_doc/mms.mdx
Normal file
@@ -0,0 +1,118 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# MMS
|
||||
|
||||
## Overview
|
||||
|
||||
The MMS model was proposed in [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2111.09296)
|
||||
by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Expanding the language coverage of speech technology has the potential to improve access to information for many more people.
|
||||
However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000
|
||||
languages spoken around the world.
|
||||
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
|
||||
The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging
|
||||
self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages,
|
||||
a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models
|
||||
for the same number of languages, as well as a language identification model for 4,017 languages.
|
||||
Experiments show that our multilingual speech recognition model more than halves the word error rate of
|
||||
Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.*
|
||||
|
||||
Tips:
|
||||
|
||||
- MMS is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. The raw waveform should be pre-processed with [`Wav2Vec2FeatureExtractor`].
|
||||
- MMS model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using
|
||||
[`Wav2Vec2CTCTokenizer`].
|
||||
- MMS can load different language adapter weights for different languages via [`~Wav2Vec2PreTrainedModel.load_adapter`]. Language adapters only consists of roughly 2 million parameters
|
||||
and can therefore be efficiently loaded on the fly when needed.
|
||||
|
||||
Relevant checkpoints can be found under https://huggingface.co/models?other=mms.
|
||||
|
||||
MMS's architecture is based on the Wav2Vec2 model, so one can refer to [Wav2Vec2's documentation page](wav2vec2).
|
||||
|
||||
The original code can be found [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
|
||||
|
||||
## Inference
|
||||
|
||||
By default MMS loads adapter weights for English, but those can be easily switched out for another language.
|
||||
Let's look at an example.
|
||||
|
||||
First, we load audio data in different languages using the [Datasets](https://github.com/huggingface/datasets).
|
||||
|
||||
```py
|
||||
from datasets import load_dataset, Audio
|
||||
|
||||
# English
|
||||
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
|
||||
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
|
||||
en_sample = next(iter(stream_data))["audio"]["array"]
|
||||
|
||||
# French
|
||||
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
|
||||
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
|
||||
fr_sample = next(iter(stream_data))["audio"]["array"]
|
||||
```
|
||||
|
||||
Next, we load the model and processor
|
||||
|
||||
```py
|
||||
from transformers import Wav2Vec2ForCTC, AutoProcessor
|
||||
import torch
|
||||
|
||||
model_id = "facebook/mms-1b-all"
|
||||
|
||||
processor = AutoProcessor.from_pretrained(model_id)
|
||||
model = Wav2Vec2ForCTC.from_pretrained(model_id)
|
||||
```
|
||||
|
||||
Now we process the audio data, pass the processed audio data to the model and transcribe the model output,
|
||||
just like we usually do for [`Wav2Vec2ForCTC`].
|
||||
|
||||
```py
|
||||
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs).logits
|
||||
|
||||
ids = torch.argmax(outputs, dim=-1)[0]
|
||||
transcription = processor.decode(ids)
|
||||
# 'joe keton disapproved of films and buster also had reservations about the media'
|
||||
```
|
||||
|
||||
We can now keep the same model in memory and simply switch out the language adapters by
|
||||
calling the convenient [`~Wav2Vec2ForCTC.load_adapter`] function for the model and [`~Wav2Vec2CTCTokenizer.set_target_lang`] for the tokenizer.
|
||||
We pass the target language as an input - `"fra"` for French.
|
||||
|
||||
```py
|
||||
processor.tokenizer.set_target_lang("fra")
|
||||
model.load_adapter("fra")
|
||||
|
||||
inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs).logits
|
||||
|
||||
ids = torch.argmax(outputs, dim=-1)[0]
|
||||
transcription = processor.decode(ids)
|
||||
# "ce dernier est volé tout au long de l'histoire romaine"
|
||||
```
|
||||
|
||||
In the same way the language can be switched out for all other supported languages. Please have a look at:
|
||||
|
||||
```py
|
||||
processor.tokenizer.vocab.keys()
|
||||
```
|
||||
|
||||
to see all supported languages.
|
||||
Reference in New Issue
Block a user