add VITS model (#24085)
* add VITS model * let's vits * finish TextEncoder (mostly) * rename VITS to Vits * add StochasticDurationPredictor * ads flow model * add generator * correctly set vocab size * add tokenizer * remove processor & feature extractor * add PosteriorEncoder * add missing weights to SDP * also convert LJSpeech and VCTK checkpoints * add training stuff in forward * add placeholder tests for tokenizer * add placeholder tests for model * starting cleanup * let the great renaming begin! * use config * global_conditioning * more cleaning * renaming variables * more renaming * more renaming * it never ends * reticulating the splines * more renaming * HiFi-GAN * doc strings for main model * fixup * fix-copies * don't make it a PreTrainedModel * fixup * rename config options * remove training logic from forward pass * simplify relative position * use actual checkpoint * style * PR review fixes * more review changes * fixup * more unit tests * fixup * fix doc test * add integration test * improve tokenizer tests * add tokenizer integration test * fix tests on GPU (gave OOM) * conversion script can handle repos from hub * add conversion script for all MMS-TTS checkpoints * automatically create a README for the converted checkpoint * small changes to config * push README to hub * only show uroman note for checkpoints that need it * remove conversion script because code formatting breaks the readme * make WaveNet layers configurable * rename variables * simplifying the math * output attentions and hidden states * remove VitsFlip in flow model * also got rid of the other flip * fix tests * rename more variables * rename tokenizer, add phonemization * raise error when phonemizer missing * re-order config docstrings to match method * change config naming * remove redundant str -> list * fix copyright: vits authors -> kakao enterprise * (mean, log_variances) -> (prior_mean, prior_log_variances) * if return dict -> if not return dict * speed -> speaking rate * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * update fused tanh sigmoid * reduce dims in tester * audio -> output_values * audio -> output_values in tuple out * fix return type * fix return type * make _unconstrained_rational_quadratic_spline a function * all nn's to accept a config * add spectro to output * move {speaking rate, noise scale, noise scale duration} to config * path -> attn_path * idxs -> valid idxs -> padded idxs * output values -> waveform * use config for attention * make generation work * harden integration test * add spectrogram to dict output * tokenizer refactor * make style * remove 'fake' padding token * harden tokenizer tests * ron norm test * fprop / save tests deterministic * move uroman to tokenizer as much as possible * better logger message * fix vivit imports * add uroman integration test * make style * up * matthijs -> sanchit-gandhi * fix tokenizer test * make fix-copies * fix dict comprehension * fix config tests * fix model tests * make outputs consistent with reverse/not reverse * fix key concat * more model details * add author * return dict * speaker error * labels error * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/vits/convert_original_checkpoint.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove uromanize * add docstrings * add docstrings for tokenizer * upper-case skip messages * fix return dict * style * finish tests * update checkpoints * make style * remove doctest file * revert * fix docstring * fix tokenizer * remove uroman integration test * add sampling rate * fix docs / docstrings * style * add sr to model output * fix outputs * style / copies * fix docstring * fix copies * remove sr from model outputs * Update utils/documentation_tests.txt Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add sr as allowed attr --------- Co-authored-by: sanchit-gandhi <sanchit@huggingface.co> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
ef10dbce5c
commit
4ece3b9433
@@ -603,6 +603,8 @@
|
||||
title: UniSpeech
|
||||
- local: model_doc/unispeech-sat
|
||||
title: UniSpeech-SAT
|
||||
- local: model_doc/vits
|
||||
title: VITS
|
||||
- local: model_doc/wav2vec2
|
||||
title: Wav2Vec2
|
||||
- local: model_doc/wav2vec2-conformer
|
||||
|
||||
@@ -255,6 +255,7 @@ The documentation is organized into five sections:
|
||||
1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||
1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||
1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
||||
1. **[VITS](model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
|
||||
1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
||||
@@ -475,6 +476,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| VitDet | ✅ | ❌ | ❌ |
|
||||
| ViTMAE | ✅ | ✅ | ❌ |
|
||||
| ViTMSN | ✅ | ❌ | ❌ |
|
||||
| VITS | ✅ | ❌ | ❌ |
|
||||
| ViViT | ✅ | ❌ | ❌ |
|
||||
| Wav2Vec2 | ✅ | ✅ | ✅ |
|
||||
| Wav2Vec2-Conformer | ✅ | ❌ | ❌ |
|
||||
|
||||
114
docs/source/en/model_doc/vits.md
Normal file
114
docs/source/en/model_doc/vits.md
Normal file
@@ -0,0 +1,114 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# VITS
|
||||
|
||||
## Overview
|
||||
|
||||
The VITS model was proposed in [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
|
||||
|
||||
|
||||
VITS (**V**ariational **I**nference with adversarial learning for end-to-end **T**ext-to-**S**peech) is an end-to-end
|
||||
speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational
|
||||
autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
|
||||
|
||||
A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based
|
||||
text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers,
|
||||
much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text
|
||||
input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to
|
||||
synthesise speech with different rhythms from the same input text.
|
||||
|
||||
The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training.
|
||||
To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During
|
||||
inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the
|
||||
waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor,
|
||||
the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.*
|
||||
|
||||
This model can also be used with TTS checkpoints from [Massively Multilingual Speech (MMS)](https://arxiv.org/abs/2305.13516)
|
||||
as these checkpoints use the same architecture and a slightly modified tokenizer.
|
||||
|
||||
This model was contributed by [Matthijs](https://huggingface.co/Matthijs) and [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original code can be found [here](https://github.com/jaywalnut310/vits).
|
||||
|
||||
## Model Usage
|
||||
|
||||
Both the VITS and MMS-TTS checkpoints can be used with the same API. Since the flow-based model is non-deterministic, it
|
||||
is good practice to set a seed to ensure reproducibility of the outputs. For languages with a Roman alphabet,
|
||||
such as English or French, the tokenizer can be used directly to pre-process the text inputs. The following code example
|
||||
runs a forward pass using the MMS-TTS English checkpoint:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import VitsTokenizer, VitsModel, set_seed
|
||||
|
||||
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
|
||||
model = VitsModel.from_pretrained("facebook/mms-tts-eng")
|
||||
|
||||
inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt")
|
||||
|
||||
set_seed(555) # make deterministic
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
waveform = outputs.waveform[0]
|
||||
```
|
||||
|
||||
The resulting waveform can be saved as a `.wav` file:
|
||||
|
||||
```python
|
||||
import scipy
|
||||
|
||||
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=waveform)
|
||||
```
|
||||
|
||||
Or displayed in a Jupyter Notebook / Google Colab:
|
||||
|
||||
```python
|
||||
from IPython.display import Audio
|
||||
|
||||
Audio(waveform, rate=model.config.sampling_rate)
|
||||
```
|
||||
|
||||
For certain languages with a non-Roman alphabet, such as Arabic, Mandarin or Hindi, the [`uroman`](https://github.com/isi-nlp/uroman)
|
||||
perl package is required to pre-process the text inputs to the Roman alphabet.
|
||||
|
||||
You can check whether you require the `uroman` package for your language by inspecting the `is_uroman` attribute of
|
||||
the pre-trained `tokenizer`:
|
||||
|
||||
```python
|
||||
from transformers import VitsTokenizer
|
||||
|
||||
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
|
||||
print(tokenizer.is_uroman)
|
||||
```
|
||||
|
||||
If required, you should apply the uroman package to your text inputs **prior** to passing them to the `VitsTokenizer`,
|
||||
since currently the tokenizer does not support performing the pre-processing itself.
|
||||
|
||||
## VitsConfig
|
||||
|
||||
[[autodoc]] VitsConfig
|
||||
|
||||
## VitsTokenizer
|
||||
|
||||
[[autodoc]] VitsTokenizer
|
||||
- __call__
|
||||
- save_vocabulary
|
||||
|
||||
## VitsModel
|
||||
|
||||
[[autodoc]] VitsModel
|
||||
- forward
|
||||
Reference in New Issue
Block a user