[docs] Updates to TTS task guide with regards to the new TTS pipeline (#26095)
* tts guide updates with a pipeline * Apply suggestions from code review Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> * Update docs/source/en/tasks/text-to-speech.md Co-authored-by: Vaibhav Srivastav <vaibhavs10@gmail.com> --------- Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> Co-authored-by: Vaibhav Srivastav <vaibhavs10@gmail.com>
This commit is contained in:
@@ -19,16 +19,40 @@ rendered properly in your Markdown viewer.
|
|||||||
[[open-in-colab]]
|
[[open-in-colab]]
|
||||||
|
|
||||||
Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in multiple
|
Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in multiple
|
||||||
languages and for multiple speakers. The only text-to-speech model currently available in 🤗 Transformers
|
languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as
|
||||||
is [SpeechT5](model_doc/speecht5), though more will be added in the future. SpeechT5 is pre-trained on a combination of
|
[Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5).
|
||||||
|
|
||||||
|
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Bark,
|
||||||
|
can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
|
||||||
|
Here's an example of how you would use the `"text-to-speech"` pipeline with Bark:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> from transformers import pipeline
|
||||||
|
|
||||||
|
>>> pipe = pipeline("text-to-speech", model="suno/bark-small")
|
||||||
|
>>> text = "[clears throat] This is a test ... and I just took a long pause."
|
||||||
|
>>> output = pipe(text)
|
||||||
|
```
|
||||||
|
|
||||||
|
Here's a code snippet you can use to listen to the resulting audio in a notebook:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from IPython.display import Audio
|
||||||
|
>>> Audio(output["audio"], rate=output["sampling_rate"])
|
||||||
|
```
|
||||||
|
|
||||||
|
For more examples on what Bark and other pretrained TTS models can do, refer to our
|
||||||
|
[Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
|
||||||
|
|
||||||
|
If you are looking to fine-tune a TTS model, you can currently fine-tune SpeechT5 only. SpeechT5 is pre-trained on a combination of
|
||||||
speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text
|
speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text
|
||||||
and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5
|
and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5
|
||||||
supports multiple speakers through x-vector speaker embeddings.
|
supports multiple speakers through x-vector speaker embeddings.
|
||||||
|
|
||||||
This guide illustrates how to:
|
The remainder of this guide illustrates how to:
|
||||||
|
|
||||||
1. Fine-tune [SpeechT5](model_doc/speecht5) that was originally trained on English speech on the Dutch (`nl`) language subset of the [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) dataset.
|
1. Fine-tune [SpeechT5](../model_doc/speecht5) that was originally trained on English speech on the Dutch (`nl`) language subset of the [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) dataset.
|
||||||
2. Use your fine-tuned model for inference.
|
2. Use your refined model for inference in one of two ways: using a pipeline or directly.
|
||||||
|
|
||||||
Before you begin, make sure you have all the necessary libraries installed:
|
Before you begin, make sure you have all the necessary libraries installed:
|
||||||
|
|
||||||
@@ -485,6 +509,12 @@ the `per_device_train_batch_size` incrementally by factors of 2 and increase `gr
|
|||||||
>>> trainer.train()
|
>>> trainer.train()
|
||||||
```
|
```
|
||||||
|
|
||||||
|
To be able to use your checkpoint with a pipeline, make sure to save the processor with the checkpoint:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> processor.save_pretrained("YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl")
|
||||||
|
```
|
||||||
|
|
||||||
Push the final model to the 🤗 Hub:
|
Push the final model to the 🤗 Hub:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
@@ -493,29 +523,70 @@ Push the final model to the 🤗 Hub:
|
|||||||
|
|
||||||
## Inference
|
## Inference
|
||||||
|
|
||||||
|
### Inference with a pipeline
|
||||||
|
|
||||||
Great, now that you've fine-tuned a model, you can use it for inference!
|
Great, now that you've fine-tuned a model, you can use it for inference!
|
||||||
Load the model from the 🤗 Hub (make sure to use your account name in the following code snippet):
|
First, let's see how you can use it with a corresponding pipeline. Let's create a `"text-to-speech"` pipeline with your
|
||||||
|
checkpoint:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl")
|
>>> from transformers import pipeline
|
||||||
|
|
||||||
|
>>> pipe = pipeline("text-to-speech", model="YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl")
|
||||||
```
|
```
|
||||||
|
|
||||||
Pick an example, here we'll take one from the test dataset. Obtain a speaker embedding.
|
Pick a piece of text in Dutch you'd like narrated, e.g.:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"
|
||||||
|
```
|
||||||
|
|
||||||
|
To use SpeechT5 with the pipeline, you'll need a speaker embedding. Let's get it from an example in the test dataset:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> example = dataset["test"][304]
|
>>> example = dataset["test"][304]
|
||||||
>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)
|
>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)
|
||||||
```
|
```
|
||||||
|
|
||||||
Define some input text and tokenize it.
|
Now you can pass the text and speaker embeddings to the pipeline, and it will take care of the rest:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> forward_params = {"speaker_embeddings": speaker_embeddings}
|
||||||
|
>>> output = pipe(text, forward_params=forward_params)
|
||||||
|
>>> output
|
||||||
|
{'audio': array([-6.82714235e-05, -4.26525949e-04, 1.06134125e-04, ...,
|
||||||
|
-1.22392643e-03, -7.76011671e-04, 3.29112721e-04], dtype=float32),
|
||||||
|
'sampling_rate': 16000}
|
||||||
|
```
|
||||||
|
|
||||||
|
You can then listen to the result:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> from IPython.display import Audio
|
||||||
|
>>> Audio(output['audio'], rate=output['sampling_rate'])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run inference manually
|
||||||
|
|
||||||
|
You can achieve the same inference results without using the pipeline, however, more steps will be required.
|
||||||
|
|
||||||
|
Load the model from the 🤗 Hub:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl")
|
||||||
|
```
|
||||||
|
|
||||||
|
Pick an example from the test dataset obtain a speaker embedding.
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> example = dataset["test"][304]
|
||||||
|
>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)
|
||||||
|
```
|
||||||
|
|
||||||
|
Define the input text and tokenize it.
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"
|
>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"
|
||||||
```
|
|
||||||
|
|
||||||
Preprocess the input text:
|
|
||||||
|
|
||||||
```py
|
|
||||||
>>> inputs = processor(text=text, return_tensors="pt")
|
>>> inputs = processor(text=text, return_tensors="pt")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user