Update audio examples with MInDS-14 (#16633)
* ✨ update audio examples with minds dataset
* 🖍 make style
* 🖍 minor fixes for doctests
This commit is contained in:
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
Automatic speech recognition (ASR) converts a speech signal to text. It is an example of a sequence-to-sequence task, going from a sequence of audio inputs to textual outputs. Voice assistants like Siri and Alexa utilize ASR models to assist users.
|
||||
|
||||
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [TIMIT](https://huggingface.co/datasets/timit_asr) dataset to transcribe audio to text.
|
||||
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to transcribe audio to text.
|
||||
|
||||
<Tip>
|
||||
|
||||
@@ -24,50 +24,54 @@ See the automatic speech recognition [task page](https://huggingface.co/tasks/au
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load TIMIT dataset
|
||||
## Load MInDS-14 dataset
|
||||
|
||||
Load the TIMIT dataset from the 🤗 Datasets library:
|
||||
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
>>> from datasets import load_dataset, Audio
|
||||
|
||||
>>> timit = load_dataset("timit_asr")
|
||||
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
Split this dataset into a train and test set:
|
||||
|
||||
```py
|
||||
>>> timit
|
||||
>>> minds = minds.train_test_split(test_size=0.2)
|
||||
```
|
||||
|
||||
Then take a look at the dataset:
|
||||
|
||||
```py
|
||||
>>> minds
|
||||
DatasetDict({
|
||||
train: Dataset({
|
||||
features: ['file', 'audio', 'text', 'phonetic_detail', 'word_detail', 'dialect_region', 'sentence_type', 'speaker_id', 'id'],
|
||||
num_rows: 4620
|
||||
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
|
||||
num_rows: 450
|
||||
})
|
||||
test: Dataset({
|
||||
features: ['file', 'audio', 'text', 'phonetic_detail', 'word_detail', 'dialect_region', 'sentence_type', 'speaker_id', 'id'],
|
||||
num_rows: 1680
|
||||
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
|
||||
num_rows: 113
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
While the dataset contains a lot of helpful information, like `dialect_region` and `sentence_type`, you will focus on the `audio` and `text` fields in this guide. Remove the other columns:
|
||||
While the dataset contains a lot of helpful information, like `lang_id` and `intent_class`, you will focus on the `audio` and `transcription` columns in this guide. Remove the other columns:
|
||||
|
||||
```py
|
||||
>>> timit = timit.remove_columns(
|
||||
... ["phonetic_detail", "word_detail", "dialect_region", "id", "sentence_type", "speaker_id"]
|
||||
... )
|
||||
>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])
|
||||
```
|
||||
|
||||
Take a look at the example again:
|
||||
|
||||
```py
|
||||
>>> timit["train"][0]
|
||||
{'audio': {'array': array([-2.1362305e-04, 6.1035156e-05, 3.0517578e-05, ...,
|
||||
-3.0517578e-05, -9.1552734e-05, -6.1035156e-05], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/404950a46da14eac65eb4e2a8317b1372fb3971d980d91d5d5b221275b1fd7e0/data/TRAIN/DR4/MMDM0/SI681.WAV',
|
||||
'sampling_rate': 16000},
|
||||
'file': '/root/.cache/huggingface/datasets/downloads/extracted/404950a46da14eac65eb4e2a8317b1372fb3971d980d91d5d5b221275b1fd7e0/data/TRAIN/DR4/MMDM0/SI681.WAV',
|
||||
'text': 'Would such an act of refusal be useful?'}
|
||||
>>> minds["train"][0]
|
||||
{'audio': {'array': array([-0.00024414, 0. , 0. , ..., 0.00024414,
|
||||
0.00024414, 0.00024414], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
|
||||
'sampling_rate': 8000},
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
|
||||
'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
|
||||
```
|
||||
|
||||
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.
|
||||
@@ -82,6 +86,19 @@ Load the Wav2Vec2 processor to process the audio signal and transcribed text:
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
|
||||
```
|
||||
|
||||
The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:
|
||||
|
||||
```py
|
||||
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
|
||||
>>> minds["train"][0]
|
||||
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
|
||||
2.78103951e-04, 2.38446111e-04, 1.18740834e-04], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
|
||||
'sampling_rate': 16000},
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
|
||||
'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}
|
||||
```
|
||||
|
||||
The preprocessing function needs to:
|
||||
|
||||
1. Call the `audio` column to load and resample the audio file.
|
||||
@@ -96,14 +113,14 @@ The preprocessing function needs to:
|
||||
... batch["input_length"] = len(batch["input_values"])
|
||||
|
||||
... with processor.as_target_processor():
|
||||
... batch["labels"] = processor(batch["text"]).input_ids
|
||||
... batch["labels"] = processor(batch["transcription"]).input_ids
|
||||
... return batch
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> timit = timit.map(prepare_dataset, remove_columns=timit.column_names["train"], num_proc=4)
|
||||
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
|
||||
```
|
||||
|
||||
🤗 Transformers doesn't have a data collator for automatic speech recognition, so you will need to create one. You can adapt the [`DataCollatorWithPadding`] to create a batch of examples for automatic speech recognition. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
@@ -165,7 +182,7 @@ Load Wav2Vec2 with [`AutoModelForCTC`]. For `ctc_loss_reduction`, it is often be
|
||||
>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForCTC.from_pretrained(
|
||||
... "facebook/wav2vec-base",
|
||||
... "facebook/wav2vec2-base",
|
||||
... ctc_loss_reduction="mean",
|
||||
... pad_token_id=processor.tokenizer.pad_token_id,
|
||||
... )
|
||||
@@ -200,8 +217,8 @@ At this point, only three steps remain:
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=timit["train"],
|
||||
... eval_dataset=timit["test"],
|
||||
... train_dataset=encoded_minds["train"],
|
||||
... eval_dataset=encoded_minds["test"],
|
||||
... tokenizer=processor.feature_extractor,
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
Audio classification assigns a label or class to audio data. It is similar to text classification, except an audio input is continuous and must be discretized, whereas text can be split into tokens. Some practical applications of audio classification include identifying intent, speakers, and even animal species by their sounds.
|
||||
|
||||
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the Keyword Spotting subset of the [SUPERB](https://huggingface.co/datasets/superb) benchmark to classify utterances.
|
||||
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) to classify intent.
|
||||
|
||||
<Tip>
|
||||
|
||||
@@ -24,27 +24,59 @@ See the audio classification [task page](https://huggingface.co/tasks/audio-clas
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load SUPERB dataset
|
||||
## Load MInDS-14 dataset
|
||||
|
||||
Load the SUPERB dataset from the 🤗 Datasets library:
|
||||
Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
>>> from datasets import load_dataset, Audio
|
||||
|
||||
>>> ks = load_dataset("superb", "ks")
|
||||
>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
Split this dataset into a train and test set:
|
||||
|
||||
```py
|
||||
>>> ks["train"][0]
|
||||
{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'sampling_rate': 16000}, 'file': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'label': 10}
|
||||
>>> minds = minds.train_test_split(test_size=0.2)
|
||||
```
|
||||
|
||||
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file. The `label` column is an integer that represents the utterance class. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:
|
||||
Then take a look at the dataset:
|
||||
|
||||
```py
|
||||
>>> labels = ks["train"].features["label"].names
|
||||
>>> minds
|
||||
DatasetDict({
|
||||
train: Dataset({
|
||||
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
|
||||
num_rows: 450
|
||||
})
|
||||
test: Dataset({
|
||||
features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
|
||||
num_rows: 113
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
While the dataset contains a lot of other useful information, like `lang_id` and `english_transcription`, you will focus on the `audio` and `intent_class` in this guide. Remove the other columns:
|
||||
|
||||
```py
|
||||
>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])
|
||||
```
|
||||
|
||||
Take a look at an example now:
|
||||
|
||||
```py
|
||||
>>> minds["train"][0]
|
||||
{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00048828,
|
||||
-0.00024414, -0.00024414], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
|
||||
'sampling_rate': 8000},
|
||||
'intent_class': 2}
|
||||
```
|
||||
|
||||
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file. The `intent_class` column is an integer that represents the class id of intent. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:
|
||||
|
||||
```py
|
||||
>>> labels = minds["train"].features["intent_class"].names
|
||||
>>> label2id, id2label = dict(), dict()
|
||||
>>> for i, label in enumerate(labels):
|
||||
... label2id[label] = str(i)
|
||||
@@ -54,11 +86,11 @@ The `audio` column contains a 1-dimensional `array` of the speech signal that mu
|
||||
Now you can convert the label number to a label name for more information:
|
||||
|
||||
```py
|
||||
>>> id2label[str(10)]
|
||||
'_silence_'
|
||||
>>> id2label[str(2)]
|
||||
'app_error'
|
||||
```
|
||||
|
||||
Each keyword - or label - corresponds to a number; `10` indicates `silence` in the example above.
|
||||
Each keyword - or label - corresponds to a number; `2` indicates `app_error` in the example above.
|
||||
|
||||
## Preprocess
|
||||
|
||||
@@ -70,6 +102,18 @@ Load the Wav2Vec2 feature extractor to process the audio signal:
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
|
||||
```
|
||||
|
||||
The [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:
|
||||
|
||||
```py
|
||||
>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
|
||||
>>> minds["train"][0]
|
||||
{'audio': {'array': array([ 2.2098757e-05, 4.6582241e-05, -2.2803260e-05, ...,
|
||||
-2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
|
||||
'sampling_rate': 16000},
|
||||
'intent_class': 2}
|
||||
```
|
||||
|
||||
The preprocessing function needs to:
|
||||
|
||||
1. Call the `audio` column to load and if necessary resample the audio file.
|
||||
@@ -85,10 +129,11 @@ The preprocessing function needs to:
|
||||
... return inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
|
||||
|
||||
```py
|
||||
>>> encoded_ks = ks.map(preprocess_function, remove_columns=["audio", "file"], batched=True)
|
||||
>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
|
||||
>>> encoded_minds = encoded_minds.rename_column("intent_class", "label")
|
||||
```
|
||||
|
||||
## Train
|
||||
@@ -130,8 +175,8 @@ At this point, only three steps remain:
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=encoded_ks["train"],
|
||||
... eval_dataset=encoded_ks["validation"],
|
||||
... train_dataset=encoded_minds["train"],
|
||||
... eval_dataset=encoded_minds["test"],
|
||||
... tokenizer=feature_extractor,
|
||||
... )
|
||||
|
||||
|
||||
Reference in New Issue
Block a user