[Docs] Fix spelling and grammar mistakes (#28825)
* Fix typos and grammar mistakes in docs and examples * Fix typos in docstrings and comments * Fix spelling of `tokenizer` in model tests * Remove erroneous spaces in decorators * Remove extra spaces in Markdown link texts
This commit is contained in:
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
## Overview
|
||||
|
||||
The Informer model was proposed in [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting ](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
|
||||
The Informer model was proposed in [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
|
||||
|
||||
This method introduces a Probabilistic Attention mechanism to select the "active" queries rather than the "lazy" queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.
|
||||
|
||||
|
||||
@@ -27,7 +27,7 @@ The abstract from the paper is the following:
|
||||
*We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples, along with model weights and code.*
|
||||
|
||||
As shown on the following figure, Jukebox is made of 3 `priors` which are decoder only models. They follow the architecture described in [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509), modified to support longer context length.
|
||||
First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditionner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution.
|
||||
First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditioner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution.
|
||||
The metadata such as *artist, genre and timing* are passed to each prior, in the form of a start token and positional embedding for the timing data. The hidden states are mapped to the closest codebook vector from the VQVAE in order to convert them to raw audio.
|
||||
|
||||

|
||||
@@ -37,7 +37,7 @@ The original code can be found [here](https://github.com/openai/jukebox).
|
||||
|
||||
## Usage tips
|
||||
|
||||
- This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face traineer!
|
||||
- This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face trainer!
|
||||
- This model is very slow, and takes 8h to generate a minute long audio using the 5b top prior on a V100 GPU. In order automaticallay handle the device on which the model should execute, use `accelerate`.
|
||||
- Contrary to the paper, the order of the priors goes from `0` to `1` as it felt more intuitive : we sample starting from `0`.
|
||||
- Primed sampling (conditioning the sampling on raw audio) requires more memory than ancestral sampling and should be used with `fp16` set to `True`.
|
||||
|
||||
@@ -27,7 +27,7 @@ The model can be used for tasks like question answering on web pages or informat
|
||||
state-of-the-art results on 2 important benchmarks:
|
||||
- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages)
|
||||
- [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset
|
||||
for information extraction from web pages (basically named-entity recogntion on web pages)
|
||||
for information extraction from web pages (basically named-entity recognition on web pages)
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
|
||||
@@ -39,7 +39,7 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The
|
||||
|
||||
## Usage tips
|
||||
|
||||
- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxilary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
|
||||
- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
|
||||
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
|
||||
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
|
||||
|
||||
@@ -40,7 +40,7 @@ The original code can be found [here](https://github.com/mistralai/mistral-src).
|
||||
|
||||
Mixtral-45B is a decoder-based LM with the following architectural choices:
|
||||
|
||||
* Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dipatched twice (top 2 routing) and thus the compute (the operation required at each foward computation) is just 2 X sequence_length.
|
||||
* Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
|
||||
|
||||
The following implementation details are shared with Mistral AI's first model [mistral](mistral):
|
||||
* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
|
||||
|
||||
@@ -283,7 +283,7 @@ waveform = outputs.waveform[0]
|
||||
|
||||
**Tips:**
|
||||
|
||||
* The MMS-TTS checkpoints are trained on lower-cased, un-punctuated text. By default, the `VitsTokenizer` *normalizes* the inputs by removing any casing and punctuation, to avoid passing out-of-vocabulary characters to the model. Hence, the model is agnostic to casing and punctuation, so these should be avoided in the text prompt. You can disable normalisation by setting `noramlize=False` in the call to the tokenizer, but this will lead to un-expected behaviour and is discouraged.
|
||||
* The MMS-TTS checkpoints are trained on lower-cased, un-punctuated text. By default, the `VitsTokenizer` *normalizes* the inputs by removing any casing and punctuation, to avoid passing out-of-vocabulary characters to the model. Hence, the model is agnostic to casing and punctuation, so these should be avoided in the text prompt. You can disable normalisation by setting `normalize=False` in the call to the tokenizer, but this will lead to un-expected behaviour and is discouraged.
|
||||
* The speaking rate can be varied by setting the attribute `model.speaking_rate` to a chosen value. Likewise, the randomness of the noise is controlled by `model.noise_scale`:
|
||||
|
||||
```python
|
||||
|
||||
@@ -54,7 +54,7 @@ found [here](https://github.com/google/trax/tree/master/trax/models/reformer).
|
||||
|
||||
Axial Positional Encodings were first implemented in Google's [trax library](https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29)
|
||||
and developed by the authors of this model's paper. In models that are treating very long input sequences, the
|
||||
conventional position id encodings store an embedings vector of size \\(d\\) being the `config.hidden_size` for
|
||||
conventional position id encodings store an embeddings vector of size \\(d\\) being the `config.hidden_size` for
|
||||
every position \\(i, \ldots, n_s\\), with \\(n_s\\) being `config.max_embedding_size`. This means that having
|
||||
a sequence length of \\(n_s = 2^{19} \approx 0.5M\\) and a `config.hidden_size` of \\(d = 2^{10} \approx 1000\\)
|
||||
would result in a position encoding matrix:
|
||||
|
||||
@@ -89,7 +89,7 @@ In a traditional auto-regressive Transformer, attention is written as
|
||||
|
||||
$$O = \hbox{softmax}(QK^{T} / \sqrt{d}) V$$
|
||||
|
||||
with \\(Q\\), \\(K\\) and \\(V\\) are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product \\(QK^{T}\\) then has shape `seq_len x seq_len` and we can take the maxtrix product with \\(V\\) to get the output \\(O\\) of the same shape as the others.
|
||||
with \\(Q\\), \\(K\\) and \\(V\\) are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product \\(QK^{T}\\) then has shape `seq_len x seq_len` and we can take the matrix product with \\(V\\) to get the output \\(O\\) of the same shape as the others.
|
||||
|
||||
Replacing the softmax by its value gives:
|
||||
|
||||
@@ -109,7 +109,7 @@ with \\(u\\) and \\(w\\) learnable parameters called in the code `time_first` an
|
||||
|
||||
$$N_{i} = e^{u + K_{i}} V_{i} + \hat{N}_{i} \hbox{ where } \hat{N}_{i} = e^{K_{i-1}} V_{i-1} + e^{w + K_{i-2}} V_{i-2} \cdots + e^{(i-2)w + K_{1}} V_{1}$$
|
||||
|
||||
so \\(\hat{N}_{i}\\) (called `numerator_state` in the code) satistfies
|
||||
so \\(\hat{N}_{i}\\) (called `numerator_state` in the code) satisfies
|
||||
|
||||
$$\hat{N}_{0} = 0 \hbox{ and } \hat{N}_{j+1} = e^{K_{j}} V_{j} + e^{w} \hat{N}_{j}$$
|
||||
|
||||
@@ -117,7 +117,7 @@ and
|
||||
|
||||
$$D_{i} = e^{u + K_{i}} + \hat{D}_{i} \hbox{ where } \hat{D}_{i} = e^{K_{i-1}} + e^{w + K_{i-2}} \cdots + e^{(i-2)w + K_{1}}$$
|
||||
|
||||
so \\(\hat{D}_{i}\\) (called `denominator_state` in the code) satistfies
|
||||
so \\(\hat{D}_{i}\\) (called `denominator_state` in the code) satisfies
|
||||
|
||||
$$\hat{D}_{0} = 0 \hbox{ and } \hat{D}_{j+1} = e^{K_{j}} + e^{w} \hat{D}_{j}$$
|
||||
|
||||
|
||||
@@ -47,7 +47,7 @@ found [here](https://github.com/google-research/t5x).
|
||||
|
||||
- UMT5 was only pre-trained on [mC4](https://huggingface.co/datasets/mc4) excluding any supervised training.
|
||||
Therefore, this model has to be fine-tuned before it is usable on a downstream task, unlike the original T5 model.
|
||||
- Since umT5 was pre-trained in an unsupervise manner, there's no real advantage to using a task prefix during single-task
|
||||
- Since umT5 was pre-trained in an unsupervised manner, there's no real advantage to using a task prefix during single-task
|
||||
fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
|
||||
|
||||
## Differences with mT5?
|
||||
|
||||
@@ -31,7 +31,7 @@ this paper, we aim to improve the existing SSL framework for speaker representat
|
||||
introduced for enhancing the unsupervised speaker information extraction. First, we apply the multi-task learning to
|
||||
the current SSL framework, where we integrate the utterance-wise contrastive loss with the SSL objective function.
|
||||
Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where
|
||||
additional overlapped utterances are created unsupervisely and incorporate during training. We integrate the proposed
|
||||
additional overlapped utterances are created unsupervisedly and incorporate during training. We integrate the proposed
|
||||
methods into the HuBERT framework. Experiment results on SUPERB benchmark show that the proposed system achieves
|
||||
state-of-the-art performance in universal representation learning, especially for speaker identification oriented
|
||||
tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up training
|
||||
|
||||
@@ -39,7 +39,7 @@ Tips:
|
||||
|
||||
- VAN does not have an embedding layer, thus the `hidden_states` will have a length equal to the number of stages.
|
||||
|
||||
The figure below illustrates the architecture of a Visual Aattention Layer. Taken from the [original paper](https://arxiv.org/abs/2202.09741).
|
||||
The figure below illustrates the architecture of a Visual Attention Layer. Taken from the [original paper](https://arxiv.org/abs/2202.09741).
|
||||
|
||||
<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/van_architecture.png"/>
|
||||
|
||||
|
||||
@@ -60,7 +60,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
|
||||
🚀 Deploy
|
||||
|
||||
- A blog post on how to deploy Wav2Vec2 for [Automatic Speech Recogntion with Hugging Face's Transformers & Amazon SageMaker](https://www.philschmid.de/automatic-speech-recognition-sagemaker).
|
||||
- A blog post on how to deploy Wav2Vec2 for [Automatic Speech Recognition with Hugging Face's Transformers & Amazon SageMaker](https://www.philschmid.de/automatic-speech-recognition-sagemaker).
|
||||
|
||||
## Wav2Vec2Config
|
||||
|
||||
|
||||
@@ -31,7 +31,7 @@ challenging. In this paper, we propose a new pre-trained model, WavLM, to solve
|
||||
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity
|
||||
preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on
|
||||
recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where
|
||||
additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up
|
||||
additional overlapped utterances are created unsupervisedly and incorporated during model training. Lastly, we scale up
|
||||
the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB
|
||||
benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.*
|
||||
|
||||
|
||||
Reference in New Issue
Block a user