Add FastSpeech2Conformer (#23439)
* start - docs, SpeechT5 copy and rename * add relevant code from FastSpeech2 draft, have tests pass * make it an actual conformer, demo ex. * matching inference with original repo, includes debug code * refactor nn.Sequentials, start more desc. var names * more renaming * more renaming * vocoder scratchwork * matching vocoder outputs * hifigan vocoder conversion script * convert model script, rename some config vars * replace postnet with speecht5's implementation * passing common tests, file cleanup * expand testing, add output hidden states and attention * tokenizer + passing tokenizer tests * variety of updates and tests * g2p_en pckg setup * import structure edits * docstrings and cleanup * repo consistency * deps * small cleanup * forward signature param order * address comments except for masks and labels * address comments on attention_mask and labels * address second round of comments * remove old unneeded line * address comments part 1 * address comments pt 2 * rename auto mapping * fixes for failing tests * address comments part 3 (bart-like, train loss) * make style * pass config where possible * add forward method + tests to WithHifiGan model * make style * address arg passing and generate_speech comments * address Arthur comments * address Arthur comments pt2 * lint changes * Sanchit comment * add g2p-en to doctest deps * move up self.encoder * onnx compatible tensor method * fix is symbolic * fix paper url * move models to espnet org * make style * make fix-copies * update docstring * Arthur comments * update docstring w/ new updates * add model architecture images * header size * md wording update * make style
This commit is contained in:
@@ -44,10 +44,8 @@ Here's a code snippet you can use to listen to the resulting audio in a notebook
|
||||
For more examples on what Bark and other pretrained TTS models can do, refer to our
|
||||
[Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
|
||||
|
||||
If you are looking to fine-tune a TTS model, you can currently fine-tune SpeechT5 only. SpeechT5 is pre-trained on a combination of
|
||||
speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text
|
||||
and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5
|
||||
supports multiple speakers through x-vector speaker embeddings.
|
||||
If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers
|
||||
are [SpeechT5](model_doc/speecht5) and [FastSpeech2Conformer](model_doc/fastspeech2_conformer), though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings.
|
||||
|
||||
The remainder of this guide illustrates how to:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user