[Speech Examples] Add pytorch speech pretraining (#13877)
* adapt wav2vec2 * add example * add files * adapt * remove bogus file * Apply suggestions from code review * adapt files more * upload changes * del old files * up * up * up * up * up * correct gradient checkpoitning * add readme * finish * finish * up * more fixes * up * up * add demo run to readme * up
This commit is contained in:
committed by
GitHub
parent
3499728dc4
commit
d45fc7da3d
@@ -3,6 +3,7 @@ scikit-learn
|
||||
seqeval
|
||||
psutil
|
||||
sacrebleu >= 1.4.12
|
||||
accelerate >= 0.5.0
|
||||
rouge-score
|
||||
tensorflow_datasets
|
||||
matplotlib
|
||||
|
||||
124
examples/pytorch/speech-pretraining/README.md
Normal file
124
examples/pytorch/speech-pretraining/README.md
Normal file
@@ -0,0 +1,124 @@
|
||||
<!---
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
# Speech Recognition Pre-Training
|
||||
|
||||
|
||||
## Wav2Vec2 Speech Pre-Training
|
||||
|
||||
The script [`run_speech_wav2vec2_pretraining_no_trainer.py`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py) can be used to pre-train a [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html?highlight=wav2vec2) model from scratch.
|
||||
|
||||
In the script [`run_speech_wav2vec2_pretraining_no_trainer`](https://github.com/huggingface/transformers/blob/master/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py), a Wav2Vec2 model is pre-trained on audio data alone using [Wav2Vec2's contrastive loss objective](https://arxiv.org/abs/2006.11477).
|
||||
|
||||
The following examples show how to fine-tune a `"base"`-sized Wav2Vec2 model as well as a `"large"`-sized Wav2Vec2 model using [`accelerate`](https://github.com/huggingface/accelerate).
|
||||
|
||||
|
||||
---
|
||||
**NOTE 1**
|
||||
|
||||
Wav2Vec2's pre-training is known to be quite unstable.
|
||||
It is advised to do a couple of test runs with a smaller dataset,
|
||||
*i.e.* `--dataset_config_names clean clean`, `--dataset_split_names validation test`
|
||||
to find good hyper-parameters for `learning_rate`, `batch_size`, `num_warmup_steps`,
|
||||
and the optimizer.
|
||||
A good metric to observe during training is the gradient norm which should ideally be between 0.5 and 2.
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
**NOTE 2**
|
||||
|
||||
When training a model on large datasets it is recommended to run the data preprocessing
|
||||
in a first run in a **non-distributed** mode via `--preprocessing_only` so that
|
||||
when running the model in **distributed** mode in a second step the preprocessed data
|
||||
can easily be loaded on each distributed device.
|
||||
|
||||
---
|
||||
|
||||
### Demo
|
||||
|
||||
In this demo run we pre-train a `"base-sized"` Wav2Vec2 model simply only on the validation
|
||||
and test data of [librispeech_asr](https://huggingface.co/datasets/librispeech_asr).
|
||||
|
||||
The demo is run on two Titan RTX (24 GB RAM each). In case you have less RAM available
|
||||
per device, consider reducing `--batch_size` and/or the `--max_duration_in_seconds`.
|
||||
|
||||
|
||||
```bash
|
||||
accelerate launch run_wav2vec2_pretraining_no_trainer.py \
|
||||
--dataset_name="librispeech_asr" \
|
||||
--dataset_config_names clean clean \
|
||||
--dataset_split_names validation test \
|
||||
--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
|
||||
--output_dir="./wav2vec2-pretrained-demo" \
|
||||
--max_train_steps="20000" \
|
||||
--num_warmup_steps="32000" \
|
||||
--gradient_accumulation_steps="8" \
|
||||
--learning_rate="0.005" \
|
||||
--weight_decay="0.01" \
|
||||
--max_duration_in_seconds="20.0" \
|
||||
--min_duration_in_seconds="2.0" \
|
||||
--logging_steps="1" \
|
||||
--saving_steps="10000" \
|
||||
--per_device_train_batch_size="8" \
|
||||
--per_device_eval_batch_size="8" \
|
||||
--adam_beta1="0.9" \
|
||||
--adam_beta2="0.98" \
|
||||
--adam_epsilon="1e-06" \
|
||||
--gradient_checkpointing \
|
||||
```
|
||||
|
||||
The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/wav2vec2-pretrained-demo/reports/Wav2Vec2-PreTraining-Demo-Run--VmlldzoxMDk3MjAw?accessToken=oa05s1y57lizo2ocxy3k01g6db1u4pt8m6ur2n8nl4cb0ug02ms2cw313kb8ruch).
|
||||
|
||||
### Base
|
||||
|
||||
TODO (currently running...)
|
||||
|
||||
|
||||
### Large
|
||||
|
||||
To pre-train `"large-sized"` Wav2Vec2 model, *e.g.* [facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60),
|
||||
on [librispeech_asr](https://huggingface.co/datasets/librispeech_asr), the following command can be run:
|
||||
|
||||
```bash
|
||||
accelerate launch run_pretrain_no_trainer.py \
|
||||
--dataset_name=librispeech_asr \
|
||||
--dataset_config_names clean clean other \
|
||||
--dataset_split_names train.100 train.360 train.500 \
|
||||
--output_dir=./test \
|
||||
--max_train_steps=200000 \
|
||||
--num_warmup_steps=32000 \
|
||||
--gradient_accumulation_steps=8 \
|
||||
--learning_rate=0.001 \
|
||||
--weight_decay=0.01 \
|
||||
--max_duration_in_seconds=20.0 \
|
||||
--min_duration_in_seconds=2.0 \
|
||||
--model_name_or_path=./
|
||||
--logging_steps=1 \
|
||||
--saving_steps=10000 \
|
||||
--per_device_train_batch_size=2 \
|
||||
--per_device_eval_batch_size=4 \
|
||||
--adam_beta1=0.9 \
|
||||
--adam_beta2=0.98 \
|
||||
--adam_epsilon=1e-06 \
|
||||
--gradient_checkpointing \
|
||||
```
|
||||
|
||||
The experiment was run on 8 GPU V100 (16 GB RAM each) for 7 days.
|
||||
In case you have more than 8 GPUs available for a higher effective `batch_size`,
|
||||
it is recommended to increase the `learning_rate` to `0.005` for faster convergence.
|
||||
|
||||
The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/pretraining-wav2vec2/reports/Wav2Vec2-Large--VmlldzoxMTAwODM4?accessToken=wm3qzcnldrwsa31tkvf2pdmilw3f63d4twtffs86ou016xjbyilh55uoi3mo1qzc) and the checkpoint pretrained for 120,000 steps can be accessed [here](https://huggingface.co/patrickvonplaten/wav2vec2-large-repro-960h-libri-120k-steps)
|
||||
4
examples/pytorch/speech-pretraining/requirements.txt
Normal file
4
examples/pytorch/speech-pretraining/requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
datasets >= 1.12.0
|
||||
torch >= 1.5
|
||||
torchaudio
|
||||
accelerate >= 0.5.0
|
||||
700
examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
Executable file
700
examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
Executable file
@@ -0,0 +1,700 @@
|
||||
#!/usr/bin/env python
|
||||
# coding=utf-8
|
||||
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
|
||||
""" Pre-Training a 🤗 Wav2Vec2 model on unlabeled audio data """
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import math
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Union
|
||||
|
||||
import datasets
|
||||
import torch
|
||||
import torchaudio
|
||||
from datasets import DatasetDict, concatenate_datasets, load_dataset
|
||||
from torch.utils.data.dataloader import DataLoader
|
||||
from tqdm.auto import tqdm
|
||||
|
||||
import transformers
|
||||
from accelerate import Accelerator
|
||||
from huggingface_hub import Repository
|
||||
from transformers import (
|
||||
AdamW,
|
||||
SchedulerType,
|
||||
Wav2Vec2Config,
|
||||
Wav2Vec2FeatureExtractor,
|
||||
Wav2Vec2ForPreTraining,
|
||||
get_scheduler,
|
||||
is_wandb_available,
|
||||
set_seed,
|
||||
)
|
||||
from transformers.file_utils import get_full_repo_name
|
||||
from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices, _sample_negative_indices
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description="Finetune a transformers model on a text classification task")
|
||||
parser.add_argument(
|
||||
"--dataset_name",
|
||||
type=str,
|
||||
default=None,
|
||||
help="The name of the dataset to use (via the datasets library).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dataset_config_names",
|
||||
nargs="+",
|
||||
type=str,
|
||||
required=True,
|
||||
help="The configuration names of the dataset to use (via the datasets library).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dataset_split_names",
|
||||
nargs="+",
|
||||
type=str,
|
||||
required=True,
|
||||
help="The names of the training data set splits to use (via the datasets library).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--preprocessing_num_workers",
|
||||
type=int,
|
||||
default=None,
|
||||
help="The number of processes to use for the preprocessing.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--preprocessing_only",
|
||||
action="store_true",
|
||||
help="Only run the preprocessing script to be cached for future use",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--cache_dir",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Where do you want to store the pretrained models downloaded from huggingface.co",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--validation_split_percentage",
|
||||
type=int,
|
||||
default=1,
|
||||
help="Percentage of training data that should be used for validation if no validation is present in dataset.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--logging_steps",
|
||||
type=int,
|
||||
default=500,
|
||||
help="Number of steps between each logging",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--saving_steps",
|
||||
type=int,
|
||||
default=500,
|
||||
help="Number of steps between each logging",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--audio_column_name",
|
||||
type=str,
|
||||
default="file",
|
||||
help="Column in the dataset that contains speech file path. Defaults to 'file'",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model_name_or_path",
|
||||
type=str,
|
||||
help="Path to pretrained model or model identifier from huggingface.co/models.",
|
||||
required=True,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--config_name",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Pretrained config name or path if not the same as model_name",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--per_device_train_batch_size",
|
||||
type=int,
|
||||
default=8,
|
||||
help="Batch size (per device) for the training dataloader.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--per_device_eval_batch_size",
|
||||
type=int,
|
||||
default=8,
|
||||
help="Batch size (per device) for the evaluation dataloader.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--learning_rate",
|
||||
type=float,
|
||||
default=5e-5,
|
||||
help="Initial learning rate (after the potential warmup period) to use.",
|
||||
)
|
||||
parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
|
||||
parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
|
||||
parser.add_argument(
|
||||
"--max_train_steps",
|
||||
type=int,
|
||||
default=None,
|
||||
help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gradient_accumulation_steps",
|
||||
type=int,
|
||||
default=1,
|
||||
help="Number of updates steps to accumulate before performing a backward/update pass.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gradient_checkpointing",
|
||||
action="store_true",
|
||||
help="If True, use gradient checkpointing to save memory at the expense of slower backward pass.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--lr_scheduler_type",
|
||||
type=SchedulerType,
|
||||
default="linear",
|
||||
help="The scheduler type to use.",
|
||||
choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"],
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler."
|
||||
)
|
||||
parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
|
||||
parser.add_argument("--seed", type=int, default=0, help="A seed for reproducible training.")
|
||||
parser.add_argument(
|
||||
"--max_gumbel_temperature",
|
||||
type=float,
|
||||
default=2.0,
|
||||
help="Maximum temperature for gumbel softmax.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--min_gumbel_temperature",
|
||||
type=float,
|
||||
default=0.5,
|
||||
help="Minimum temperature for gumbel softmax.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gumbel_temperature_decay", type=float, default=0.999995, help="Decay of gumbel temperature during training."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_duration_in_seconds",
|
||||
type=float,
|
||||
default=5.0,
|
||||
help="Filter out audio files that are longer than `max_duration_in_seconds` seconds",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--min_duration_in_seconds",
|
||||
type=float,
|
||||
default=3.0,
|
||||
help="Filter out audio files that are shorter than `min_duration_in_seconds` seconds",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pad_to_multiple_of",
|
||||
type=int,
|
||||
default=None,
|
||||
help="If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--adam_beta1",
|
||||
type=float,
|
||||
default=0.9,
|
||||
help="Beta1 for AdamW optimizer",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--adam_beta2",
|
||||
type=float,
|
||||
default=0.999,
|
||||
help="Beta2 for AdamW optimizer",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--adam_epsilon",
|
||||
type=float,
|
||||
default=1e-8,
|
||||
help="Epsilon for AdamW optimizer",
|
||||
)
|
||||
parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.")
|
||||
parser.add_argument(
|
||||
"--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`."
|
||||
)
|
||||
parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.push_to_hub:
|
||||
assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed."
|
||||
|
||||
if args.output_dir is not None:
|
||||
os.makedirs(args.output_dir, exist_ok=True)
|
||||
|
||||
return args
|
||||
|
||||
|
||||
@dataclass
|
||||
class DataCollatorForWav2Vec2Pretraining:
|
||||
"""
|
||||
Data collator that will dynamically pad the inputs received and prepare masked indices
|
||||
for self-supervised pretraining.
|
||||
|
||||
Args:
|
||||
model (:class:`~transformers.Wav2Vec2ForPreTraining`):
|
||||
The Wav2Vec2 model used for pretraining. The data collator needs to have access
|
||||
to config and ``_get_feat_extract_output_lengths`` function for correct padding.
|
||||
feature_extractor (:class:`~transformers.Wav2Vec2FeatureExtractor`):
|
||||
The processor used for proccessing the data.
|
||||
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
|
||||
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
|
||||
among:
|
||||
* :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
|
||||
sequence if provided).
|
||||
* :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
|
||||
maximum acceptable input length for the model if that argument is not provided.
|
||||
* :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
|
||||
different lengths).
|
||||
max_length (:obj:`int`, `optional`):
|
||||
Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
|
||||
pad_to_multiple_of (:obj:`int`, `optional`):
|
||||
If set will pad the sequence to a multiple of the provided value.
|
||||
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
|
||||
7.5 (Volta).
|
||||
"""
|
||||
|
||||
model: Wav2Vec2ForPreTraining
|
||||
feature_extractor: Wav2Vec2FeatureExtractor
|
||||
padding: Union[bool, str] = "longest"
|
||||
pad_to_multiple_of: Optional[int] = None
|
||||
|
||||
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
|
||||
# reformat list to dict and set to pytorch format
|
||||
batch = self.feature_extractor.pad(
|
||||
features,
|
||||
padding=self.padding,
|
||||
pad_to_multiple_of=self.pad_to_multiple_of,
|
||||
return_tensors="pt",
|
||||
)
|
||||
|
||||
device = batch["input_values"].device
|
||||
batch_size = batch["input_values"].shape[0]
|
||||
|
||||
mask_indices_seq_length = self.model._get_feat_extract_output_lengths(batch["input_values"].shape[-1])
|
||||
|
||||
# make sure that no loss is computed on padded inputs
|
||||
if batch.get("attention_mask") is not None:
|
||||
# compute real output lengths according to convolution formula
|
||||
batch["sub_attention_mask"] = self.model._get_feature_vector_attention_mask(
|
||||
mask_indices_seq_length, batch["attention_mask"]
|
||||
)
|
||||
|
||||
features_shape = (batch_size, mask_indices_seq_length)
|
||||
|
||||
# sample randomly masked indices
|
||||
mask_time_indices = _compute_mask_indices(
|
||||
features_shape,
|
||||
self.model.config.mask_time_prob,
|
||||
self.model.config.mask_time_length,
|
||||
attention_mask=batch.get("sub_attention_mask"),
|
||||
)
|
||||
|
||||
# sample negative indices
|
||||
sampled_negative_indices = _sample_negative_indices(
|
||||
features_shape,
|
||||
self.model.config.num_negatives,
|
||||
mask_time_indices=mask_time_indices,
|
||||
)
|
||||
batch["mask_time_indices"] = torch.tensor(mask_time_indices, dtype=torch.long, device=device)
|
||||
batch["sampled_negative_indices"] = torch.tensor(sampled_negative_indices, dtype=torch.long, device=device)
|
||||
|
||||
return batch
|
||||
|
||||
|
||||
def multiply_grads(params, c):
|
||||
"""Multiplies grads by a constant *c*."""
|
||||
for p in params:
|
||||
if p.grad is not None:
|
||||
if torch.is_tensor(c):
|
||||
c = c.to(p.grad.device)
|
||||
p.grad.data.mul_(c)
|
||||
|
||||
|
||||
def get_grad_norm(params, scale=1):
|
||||
"""Compute grad norm given a gradient scale."""
|
||||
total_norm = 0.0
|
||||
for p in params:
|
||||
if p.grad is not None:
|
||||
param_norm = (p.grad.detach().data / scale).norm(2)
|
||||
total_norm += param_norm.item() ** 2
|
||||
total_norm = total_norm ** 0.5
|
||||
return total_norm
|
||||
|
||||
|
||||
def main():
|
||||
# See all possible arguments in src/transformers/args.py
|
||||
# or by passing the --help flag to this script.
|
||||
# We now keep distinct sets of args, for a cleaner separation of concerns.
|
||||
args = parse_args()
|
||||
|
||||
# Initialize the accelerator. We will let the accelerator handle device placement for us in this example.
|
||||
accelerator = Accelerator()
|
||||
logger.info(accelerator.state)
|
||||
|
||||
# Setup logging, we only want one process per machine to log things on the screen.
|
||||
# accelerator.is_local_main_process is only True for one process per machine.
|
||||
logger.setLevel(logging.INFO if accelerator.is_local_main_process else logging.ERROR)
|
||||
if accelerator.is_local_main_process:
|
||||
datasets.utils.logging.set_verbosity_warning()
|
||||
transformers.utils.logging.set_verbosity_info()
|
||||
|
||||
# set up weights and biases if available
|
||||
if is_wandb_available():
|
||||
import wandb
|
||||
|
||||
wandb.init(project=args.output_dir.split("/")[-1])
|
||||
else:
|
||||
datasets.utils.logging.set_verbosity_error()
|
||||
transformers.utils.logging.set_verbosity_error()
|
||||
|
||||
# If passed along, set the training seed now.
|
||||
if args.seed is not None:
|
||||
set_seed(args.seed)
|
||||
|
||||
# Handle the repository creation
|
||||
if accelerator.is_main_process:
|
||||
if args.push_to_hub and not args.preprocessing_only:
|
||||
if args.hub_model_id is None:
|
||||
repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token)
|
||||
else:
|
||||
repo_name = args.hub_model_id
|
||||
repo = Repository(args.output_dir, clone_from=repo_name)
|
||||
elif args.output_dir is not None:
|
||||
os.makedirs(args.output_dir, exist_ok=True)
|
||||
accelerator.wait_for_everyone()
|
||||
|
||||
# 1. Download and create train, validation dataset
|
||||
# We load all dataset configuration and datset split pairs passed in
|
||||
# ``args.dataset_config_names`` and ``args.dataset_split_names``
|
||||
datasets_splits = []
|
||||
for dataset_config_name, train_split_name in zip(args.dataset_config_names, args.dataset_split_names):
|
||||
# load dataset
|
||||
dataset_split = load_dataset(
|
||||
args.dataset_name, dataset_config_name, split=train_split_name, cache_dir=args.cache_dir
|
||||
)
|
||||
datasets_splits.append(dataset_split)
|
||||
|
||||
# Next, we concatenate all configurations and splits into a single training dataset
|
||||
raw_datasets = DatasetDict()
|
||||
if len(datasets_splits) > 1:
|
||||
raw_datasets["train"] = concatenate_datasets(datasets_splits).shuffle(seed=args.seed)
|
||||
else:
|
||||
raw_datasets["train"] = datasets_splits[0]
|
||||
|
||||
# Take ``args.validation_split_percentage`` from the training dataset for the validation_split_percentage
|
||||
num_validation_samples = raw_datasets["train"].num_rows * args.validation_split_percentage // 100
|
||||
|
||||
if num_validation_samples == 0:
|
||||
raise ValueError(
|
||||
"`args.validation_split_percentage` is less than a single sample "
|
||||
f"for {len(raw_datasets['train'])} training samples. Increase "
|
||||
"`args.num_validation_split_percentage`. "
|
||||
)
|
||||
|
||||
raw_datasets["validation"] = raw_datasets["train"].select(range(num_validation_samples))
|
||||
raw_datasets["train"] = raw_datasets["train"].select(range(num_validation_samples, raw_datasets["train"].num_rows))
|
||||
|
||||
# 2. Preprocess audio: load, resample, normalize and truncate
|
||||
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(args.model_name_or_path)
|
||||
|
||||
# only normalized-inputs-training is supported
|
||||
if not feature_extractor.do_normalize:
|
||||
raise ValueError(
|
||||
"Training is only supported for normalized inputs. " "Make sure ``feature_extractor.do_normalize == True``"
|
||||
)
|
||||
|
||||
# set max & min audio length in number of samples
|
||||
max_length = int(args.max_duration_in_seconds * feature_extractor.sampling_rate)
|
||||
min_length = int(args.min_duration_in_seconds * feature_extractor.sampling_rate)
|
||||
|
||||
resampler = None
|
||||
if raw_datasets["train"][args.audio_column_name][0].split(".")[-1] == "mp3":
|
||||
# TODO(PVP) - remove hard-coded 48_000 after audio feature is merged
|
||||
resampler = torchaudio.transforms.Resample(48_000, feature_extractor.sampling_rate)
|
||||
|
||||
def prepare_dataset(batch):
|
||||
speech_array, sampling_rate = torchaudio.load(batch[args.audio_column_name])
|
||||
speech_array = speech_array.squeeze()
|
||||
|
||||
# if necessary resample audio
|
||||
if resampler is not None:
|
||||
# TODO(PVP) - remove hard-coded 48_000 after audio feature is merged
|
||||
speech_array = resampler(speech_array)
|
||||
sampling_rate = resampler.new_freq
|
||||
|
||||
speech_array = speech_array.numpy()
|
||||
inputs = feature_extractor(speech_array, sampling_rate=sampling_rate, max_length=max_length, truncation=True)
|
||||
batch["input_values"] = inputs.input_values[0]
|
||||
return batch
|
||||
|
||||
# load audio files into numpy arrays
|
||||
with accelerator.main_process_first():
|
||||
vectorized_datasets = raw_datasets.map(
|
||||
prepare_dataset,
|
||||
num_proc=args.preprocessing_num_workers,
|
||||
remove_columns=raw_datasets["train"].column_names,
|
||||
load_from_cache_file=not args.overwrite_cache,
|
||||
)
|
||||
vectorized_datasets = vectorized_datasets.filter(
|
||||
lambda x: len(x["input_values"]) > min_length, load_from_cache_file=not args.overwrite_cache
|
||||
)
|
||||
|
||||
# for large datasets it is advised to run the preprocessing on a
|
||||
# single machine first with ``args.preprocessing_only`` since there will mostly likely
|
||||
# be a timeout when running the script in distributed mode.
|
||||
# In a second step ``args.preprocessing_only`` can then be set to `False` to load the
|
||||
# cached dataset
|
||||
if args.preprocessing_only:
|
||||
return
|
||||
|
||||
# 3. Load model
|
||||
config = Wav2Vec2Config.from_pretrained(args.model_name_or_path)
|
||||
|
||||
# pretraining is only supported for "newer" stable layer norm architecture
|
||||
# apply_spec_augment has to be True, mask_feature_prob has to be 0.0
|
||||
if not config.do_stable_layer_norm or config.feat_extract_norm != "layer":
|
||||
raise ValueError(
|
||||
"PreTraining is only supported for ``config.do_stable_layer_norm=True`` and ``config.feat_extract_norm='layer'"
|
||||
)
|
||||
|
||||
# initialize random model
|
||||
model = Wav2Vec2ForPreTraining(config)
|
||||
|
||||
# Activate gradient checkpointing if needed
|
||||
if args.gradient_checkpointing:
|
||||
model.gradient_checkpointing_enable()
|
||||
|
||||
# 4. Define data collator, optimizer and scheduler
|
||||
data_collator = DataCollatorForWav2Vec2Pretraining(
|
||||
model=model, feature_extractor=feature_extractor, pad_to_multiple_of=args.pad_to_multiple_of
|
||||
)
|
||||
train_dataloader = DataLoader(
|
||||
vectorized_datasets["train"],
|
||||
shuffle=True,
|
||||
collate_fn=data_collator,
|
||||
batch_size=args.per_device_train_batch_size,
|
||||
)
|
||||
eval_dataloader = DataLoader(
|
||||
vectorized_datasets["validation"], collate_fn=data_collator, batch_size=args.per_device_eval_batch_size
|
||||
)
|
||||
|
||||
# Optimizer
|
||||
optimizer = AdamW(
|
||||
list(model.parameters()),
|
||||
lr=args.learning_rate,
|
||||
betas=[args.adam_beta1, args.adam_beta2],
|
||||
eps=args.adam_epsilon,
|
||||
)
|
||||
|
||||
# Prepare everything with our `accelerator`.
|
||||
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
|
||||
model, optimizer, train_dataloader, eval_dataloader
|
||||
)
|
||||
|
||||
# Scheduler and math around the number of training steps.
|
||||
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
|
||||
|
||||
if args.max_train_steps is None:
|
||||
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
|
||||
else:
|
||||
args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
|
||||
|
||||
lr_scheduler = get_scheduler(
|
||||
name=args.lr_scheduler_type,
|
||||
optimizer=optimizer,
|
||||
num_warmup_steps=args.num_warmup_steps,
|
||||
num_training_steps=args.max_train_steps,
|
||||
)
|
||||
|
||||
# 5. Train
|
||||
total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
|
||||
|
||||
logger.info("***** Running training *****")
|
||||
logger.info(f" Num examples = {len(vectorized_datasets['train'])}")
|
||||
logger.info(f" Num Epochs = {args.num_train_epochs}")
|
||||
logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}")
|
||||
logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
|
||||
logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}")
|
||||
logger.info(f" Total optimization steps = {args.max_train_steps}")
|
||||
completed_steps = 0
|
||||
|
||||
# Only show the progress bar once on each machine.
|
||||
progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
|
||||
completed_steps = 0
|
||||
for epoch in range(args.num_train_epochs):
|
||||
model.train()
|
||||
for step, batch in enumerate(train_dataloader):
|
||||
# compute num of losses
|
||||
num_losses = batch["mask_time_indices"].sum()
|
||||
sub_attention_mask = batch.pop("sub_attention_mask", None)
|
||||
sub_attention_mask = (
|
||||
sub_attention_mask if sub_attention_mask is not None else torch.ones_like(batch["mask_time_indices"])
|
||||
)
|
||||
percent_masked = num_losses / sub_attention_mask.sum()
|
||||
|
||||
# forward
|
||||
outputs = model(**batch)
|
||||
|
||||
# divide loss by gradient accumulation steps since gradients
|
||||
# are accumulated for multiple backward passes in PyTorch
|
||||
loss = outputs.loss / args.gradient_accumulation_steps
|
||||
accelerator.backward(loss)
|
||||
|
||||
# make sure that `num_losses` is summed for distributed training
|
||||
# and average gradients over losses of all devices
|
||||
if accelerator.state.num_processes > 1:
|
||||
num_losses = accelerator.gather(num_losses).sum()
|
||||
gradient_multiplier = accelerator.state.num_processes / num_losses
|
||||
multiply_grads(model.module.parameters(), gradient_multiplier)
|
||||
else:
|
||||
multiply_grads(model.parameters(), 1 / num_losses)
|
||||
|
||||
# update step
|
||||
if (step + 1) % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
|
||||
|
||||
# compute grad norm for monitoring
|
||||
scale = (
|
||||
accelerator.scaler._scale.item()
|
||||
if hasattr(accelerator, "scaler") and accelerator.scaler is not None
|
||||
else 1
|
||||
)
|
||||
if accelerator.state.num_processes > 1:
|
||||
grad_norm = get_grad_norm(model.module.parameters(), scale)
|
||||
else:
|
||||
grad_norm = get_grad_norm(model.parameters(), scale)
|
||||
|
||||
# update parameters
|
||||
optimizer.step()
|
||||
optimizer.zero_grad()
|
||||
|
||||
if not accelerator.optimizer_step_was_skipped:
|
||||
lr_scheduler.step()
|
||||
elif accelerator.is_local_main_process:
|
||||
progress_bar.write(
|
||||
"Gradients have overflown - skipping update step... " f"Updating gradient scale to {scale}..."
|
||||
)
|
||||
|
||||
# update gumbel temperature
|
||||
gumbel_temperature = max(
|
||||
args.max_gumbel_temperature * args.gumbel_temperature_decay ** completed_steps,
|
||||
args.min_gumbel_temperature,
|
||||
)
|
||||
if hasattr(model, "module"):
|
||||
model.module.set_gumbel_temperature(gumbel_temperature)
|
||||
else:
|
||||
model.set_gumbel_temperature(gumbel_temperature)
|
||||
|
||||
progress_bar.update(1)
|
||||
completed_steps += 1
|
||||
|
||||
# 6. Log all results
|
||||
if (step + 1) % (args.gradient_accumulation_steps * args.logging_steps) == 0:
|
||||
loss.detach()
|
||||
outputs.contrastive_loss.detach()
|
||||
outputs.diversity_loss.detach()
|
||||
|
||||
if accelerator.state.num_processes > 1:
|
||||
loss = accelerator.gather(loss).sum()
|
||||
outputs.contrastive_loss = accelerator.gather(outputs.contrastive_loss).sum()
|
||||
outputs.diversity_loss = accelerator.gather(outputs.diversity_loss).sum()
|
||||
percent_masked = accelerator.gather(percent_masked).sum()
|
||||
|
||||
train_logs = {
|
||||
"loss": (loss * args.gradient_accumulation_steps) / num_losses,
|
||||
"constrast_loss": outputs.contrastive_loss / num_losses,
|
||||
"div_loss": outputs.diversity_loss / num_losses,
|
||||
"%_mask_idx": percent_masked / accelerator.num_processes,
|
||||
"ppl": outputs.codevector_perplexity,
|
||||
"lr": torch.tensor(optimizer.param_groups[0]["lr"]),
|
||||
"temp": torch.tensor(gumbel_temperature),
|
||||
"grad_norm": torch.tensor(grad_norm),
|
||||
}
|
||||
log_str = ""
|
||||
for k, v in train_logs.items():
|
||||
log_str += "| {}: {:.3e}".format(k, v.item())
|
||||
|
||||
if accelerator.is_local_main_process:
|
||||
progress_bar.write(log_str)
|
||||
if is_wandb_available():
|
||||
wandb.log(train_logs)
|
||||
|
||||
# save model every `args.saving_steps` steps
|
||||
if (step + 1) % (args.gradient_accumulation_steps * args.saving_steps) == 0:
|
||||
if (args.push_to_hub and epoch < args.num_train_epochs - 1) or args.output_dir is not None:
|
||||
accelerator.wait_for_everyone()
|
||||
unwrapped_model = accelerator.unwrap_model(model)
|
||||
unwrapped_model.save_pretrained(args.output_dir, save_function=accelerator.save)
|
||||
|
||||
if (args.push_to_hub and epoch < args.num_train_epochs - 1) and accelerator.is_main_process:
|
||||
repo.push_to_hub(commit_message=f"Training in progress step {completed_steps}", blocking=False)
|
||||
|
||||
# if completed steps > `args.max_train_steps` stop
|
||||
if completed_steps >= args.max_train_steps:
|
||||
break
|
||||
|
||||
# 7. Validate!
|
||||
model.eval()
|
||||
|
||||
# init logs
|
||||
val_logs = {
|
||||
"val_loss": 0,
|
||||
"val_contrastive_loss": 0,
|
||||
"val_diversity_loss": 0,
|
||||
"val_num_losses": 0,
|
||||
}
|
||||
for step, batch in enumerate(eval_dataloader):
|
||||
with torch.no_grad():
|
||||
batch.pop("sub_attention_mask", None)
|
||||
outputs = model(**batch)
|
||||
|
||||
val_logs["val_loss"] += outputs.loss
|
||||
val_logs["val_contrastive_loss"] += outputs.contrastive_loss
|
||||
val_logs["val_diversity_loss"] += outputs.diversity_loss
|
||||
val_logs["val_num_losses"] += batch["mask_time_indices"].sum()
|
||||
|
||||
# sum over devices in multi-processing
|
||||
if accelerator.num_processes > 1:
|
||||
val_logs = {k: accelerator.gather(v).sum() for k, v in val_logs.items()}
|
||||
|
||||
val_logs = {k: v / val_logs["val_num_losses"] for k, v in val_logs.items()}
|
||||
|
||||
log_str = ""
|
||||
for k, v in val_logs.items():
|
||||
log_str += "| {}: {:.3e}".format(k, v.item())
|
||||
|
||||
if accelerator.is_local_main_process:
|
||||
progress_bar.write(log_str)
|
||||
if is_wandb_available():
|
||||
wandb.log(val_logs)
|
||||
|
||||
if args.output_dir is not None:
|
||||
accelerator.wait_for_everyone()
|
||||
unwrapped_model = accelerator.unwrap_model(model)
|
||||
unwrapped_model.save_pretrained(args.output_dir, save_function=accelerator.save)
|
||||
if accelerator.is_main_process:
|
||||
if args.push_to_hub:
|
||||
repo.push_to_hub(commit_message="End of training")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -23,6 +23,7 @@ from unittest.mock import patch
|
||||
|
||||
import torch
|
||||
|
||||
from transformers import Wav2Vec2ForPreTraining
|
||||
from transformers.file_utils import is_apex_available
|
||||
from transformers.testing_utils import TestCasePlus, get_gpu_count, slow, torch_device
|
||||
|
||||
@@ -41,6 +42,7 @@ SRC_DIRS = [
|
||||
"image-classification",
|
||||
"speech-recognition",
|
||||
"audio-classification",
|
||||
"speech-pretraining",
|
||||
]
|
||||
]
|
||||
sys.path.extend(SRC_DIRS)
|
||||
@@ -59,6 +61,7 @@ if SRC_DIRS is not None:
|
||||
import run_summarization
|
||||
import run_swag
|
||||
import run_translation
|
||||
import run_wav2vec2_pretraining_no_trainer
|
||||
|
||||
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
@@ -447,3 +450,32 @@ class ExamplesTests(TestCasePlus):
|
||||
run_audio_classification.main()
|
||||
result = get_results(tmp_dir)
|
||||
self.assertLess(result["eval_loss"], result["train_loss"])
|
||||
|
||||
def test_run_wav2vec2_pretraining(self):
|
||||
stream_handler = logging.StreamHandler(sys.stdout)
|
||||
logger.addHandler(stream_handler)
|
||||
|
||||
tmp_dir = self.get_auto_remove_tmp_dir()
|
||||
testargs = f"""
|
||||
run_wav2vec2_pretraining_no_trainer.py
|
||||
--output_dir {tmp_dir}
|
||||
--model_name_or_path hf-internal-testing/tiny-random-wav2vec2
|
||||
--dataset_name patrickvonplaten/librispeech_asr_dummy
|
||||
--dataset_config_names clean
|
||||
--dataset_split_names validation
|
||||
--learning_rate 1e-4
|
||||
--per_device_train_batch_size 2
|
||||
--per_device_eval_batch_size 2
|
||||
--preprocessing_num_workers 16
|
||||
--max_train_steps 5
|
||||
--validation_split_percentage 5
|
||||
--seed 42
|
||||
""".split()
|
||||
|
||||
if is_cuda_and_apex_available():
|
||||
testargs.append("--fp16")
|
||||
|
||||
with patch.object(sys, "argv", testargs):
|
||||
run_wav2vec2_pretraining_no_trainer.main()
|
||||
model = Wav2Vec2ForPreTraining.from_pretrained(tmp_dir)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
Reference in New Issue
Block a user