Use HF papers (#38184)

* Use hf papers

* Hugging Face papers

* doi to hf papers

* style
This commit is contained in:
Quentin Gallouédec
2025-06-13 13:07:09 +02:00
committed by GitHub
parent 1031ed5166
commit de24fb63ed
811 changed files with 2622 additions and 2617 deletions

View File

@@ -26,7 +26,7 @@ way which enables simple and efficient model parallelism.
## Masked language modeling
In the following, we demonstrate how to train a bi-directional transformer model
using masked language modeling objective as introduced in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
using masked language modeling objective as introduced in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://huggingface.co/papers/1810.04805).
More specifically, we demonstrate how JAX/Flax can be leveraged
to pre-train [**`FacebookAI/roberta-base`**](https://huggingface.co/FacebookAI/roberta-base)
in Norwegian on a single TPUv3-8 pod.
@@ -229,7 +229,7 @@ look at [this](https://colab.research.google.com/github/huggingface/notebooks/bl
## T5-like span-masked language modeling
In the following, we demonstrate how to train a T5 model using the span-masked language model
objective as proposed in the [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683).
objective as proposed in the [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://huggingface.co/papers/1910.10683).
More specifically, we demonstrate how JAX/Flax can be leveraged
to pre-train [**`google/t5-v1_1-base`**](https://huggingface.co/google/t5-v1_1-base)
in Norwegian on a single TPUv3-8 pod.
@@ -341,7 +341,7 @@ Training statistics can be accessed on directly on the 🤗 [hub](https://huggin
## BART: Denoising language modeling
In the following, we demonstrate how to train a BART model
using denoising language modeling objective as introduced in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461).
using denoising language modeling objective as introduced in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://huggingface.co/papers/1910.13461).
More specifically, we demonstrate how JAX/Flax can be leveraged
to pre-train [**`bart-base`**](https://huggingface.co/facebook/bart-base)
in Norwegian on a single TPUv3-8 pod.

View File

@@ -265,7 +265,7 @@ class FlaxDataCollatorForBartDenoisingLM:
Data collator used for BART denoising language modeling. The code is largely copied from
`<https://github.com/morganmcg1/rotobart/blob/main/data_collator.py#L223>`__.
For more information on how BART denoising language modeling works, one can take a look
at the `official paper <https://arxiv.org/pdf/1910.13461.pdf>`__
at the `official paper <https://huggingface.co/papers/1910.13461>`__
or the `official code for preprocessing <https://github.com/facebookresearch/fairseq/blob/main/fairseq/data/denoising_dataset.py>`__ .
Args:
tokenizer (:class:`~transformers.PreTrainedTokenizer` or :class:`~transformers.PreTrainedTokenizerFast`):

View File

@@ -309,7 +309,7 @@ class FlaxDataCollatorForT5MLM:
Data collator used for T5 span-masked language modeling.
It is made sure that after masking the inputs are of length `data_args.max_seq_length` and targets are also of fixed length.
For more information on how T5 span-masked language modeling works, one can take a look
at the `official paper <https://arxiv.org/pdf/1910.10683.pdf>`__
at the `official paper <https://huggingface.co/papers/1910.10683>`__
or the `official code for preprocessing <https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py>`__ .
Args:

View File

@@ -6,12 +6,12 @@ models which were pre-trained on the same training data (BooksCorpus and English
training, but with different relative position embeddings.
* `zhiheng-huang/bert-base-uncased-embedding-relative-key`, trained from scratch with relative embedding proposed by
Shaw et al., [Self-Attention with Relative Position Representations](https://arxiv.org/abs/1803.02155)
Shaw et al., [Self-Attention with Relative Position Representations](https://huggingface.co/papers/1803.02155)
* `zhiheng-huang/bert-base-uncased-embedding-relative-key-query`, trained from scratch with relative embedding method 4
in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)
in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://huggingface.co/papers/2009.13658)
* `zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query`, fine-tuned from model
`google-bert/bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al.
[Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658)
[Improve Transformer Models with Better Relative Position Embeddings](https://huggingface.co/papers/2009.13658)
##### Base models fine-tuning

View File

@@ -37,7 +37,7 @@ class MyNewModelConfig(PretrainedConfig):
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details, check out [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to
`num_attention_heads`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.

View File

@@ -35,7 +35,7 @@ class NewModelConfig(PretrainedConfig):
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details, check out [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to
`num_attention_heads`.
head_dim (`int`, *optional*, defaults to 256):
The attention head dimension.

View File

@@ -53,7 +53,7 @@ class TestAttention(nn.Module):
Adapted from transformers.models.mistral.modeling_mistral.MistralAttention:
The input dimension here is attention_hidden_size = 2 * hidden_size, and head_dim = attention_hidden_size // num_heads.
The extra factor of 2 comes from the input being the concatenation of original_hidden_states with the output of the previous (mamba) layer
(see fig. 2 in https://arxiv.org/pdf/2405.16712).
(see fig. 2 in https://huggingface.co/papers/2405.16712).
Additionally, replaced
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) with
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim/2)

View File

@@ -388,7 +388,7 @@ DUMMY_INPUTS_DOCSTRING = r"""
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
and modify to your needs. See diagram 1 in [the paper](https://huggingface.co/papers/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,

View File

@@ -845,7 +845,7 @@ class DummyBertModel(DummyBertPreTrainedModel):
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in [Attention is
all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
all you need](https://huggingface.co/papers/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set

View File

@@ -388,7 +388,7 @@ MULTIMODAL1_TEXT_INPUTS_DOCSTRING = r"""
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
and modify to your needs. See diagram 1 in [the paper](https://huggingface.co/papers/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,

View File

@@ -386,7 +386,7 @@ MY_NEW_MODEL2_INPUTS_DOCSTRING = r"""
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
and modify to your needs. See diagram 1 in [the paper](https://huggingface.co/papers/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,

View File

@@ -848,7 +848,7 @@ class RobertaModel(RobertaPreTrainedModel):
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in [Attention is
all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
all you need](https://huggingface.co/papers/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set

View File

@@ -389,7 +389,7 @@ SUPER_INPUTS_DOCSTRING = r"""
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
and modify to your needs. See diagram 1 in [the paper](https://huggingface.co/papers/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,

View File

@@ -25,12 +25,12 @@ NOTE: If you encounter problems/have suggestions for improvement, open an issue
## SimMIM
The `run_mim.py` script can be used to pre-train any Transformer-based vision model in the library (concretely, any model supported by the `AutoModelForMaskedImageModeling` API) for masked image modeling as proposed in [SimMIM: A Simple Framework for Masked Image Modeling](https://arxiv.org/abs/2111.09886) using PyTorch.
The `run_mim.py` script can be used to pre-train any Transformer-based vision model in the library (concretely, any model supported by the `AutoModelForMaskedImageModeling` API) for masked image modeling as proposed in [SimMIM: A Simple Framework for Masked Image Modeling](https://huggingface.co/papers/2111.09886) using PyTorch.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/simmim_architecture.jpg"
alt="drawing" width="300"/>
<small> SimMIM framework. Taken from the <a href="https://arxiv.org/abs/2111.09886">original paper</a>. </small>
<small> SimMIM framework. Taken from the <a href="https://huggingface.co/papers/2111.09886">original paper</a>. </small>
The goal for the model is to predict raw pixel values for the masked patches, using just a linear layer as prediction head. The model is trained using a simple L1 loss.
@@ -145,7 +145,7 @@ python run_mim.py \
## MAE
The `run_mae.py` script can be used to pre-train a Vision Transformer as a masked autoencoder (MAE), as proposed in [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377). The script can be used to train a `ViTMAEForPreTraining` model in the Transformers library, using PyTorch. After self-supervised pre-training, one can load the weights of the encoder directly into a `ViTForImageClassification`. The MAE method allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.
The `run_mae.py` script can be used to pre-train a Vision Transformer as a masked autoencoder (MAE), as proposed in [Masked Autoencoders Are Scalable Vision Learners](https://huggingface.co/papers/2111.06377). The script can be used to train a `ViTMAEForPreTraining` model in the Transformers library, using PyTorch. After self-supervised pre-training, one can load the weights of the encoder directly into a `ViTForImageClassification`. The MAE method allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.
The goal for the model is to predict raw pixel values for the masked patches. As the model internally masks patches and learns to reconstruct them, there's no need for any labels. The model uses the mean squared error (MSE) between the reconstructed and original images in the pixel space.
@@ -182,14 +182,14 @@ python run_mae.py \
Here we set:
- `mask_ratio` to 0.75 (to mask 75% of the patches for each image)
- `norm_pix_loss` to use normalized pixel values as target (the authors reported better representations with this enabled)
- `base_learning_rate` to 1.5e-4. Note that the effective learning rate is computed by the [linear schedule](https://arxiv.org/abs/1706.02677): `lr` = `blr` * total training batch size / 256. The total training batch size is computed as `training_args.train_batch_size` * `training_args.gradient_accumulation_steps` * `training_args.world_size`.
- `base_learning_rate` to 1.5e-4. Note that the effective learning rate is computed by the [linear schedule](https://huggingface.co/papers/1706.02677): `lr` = `blr` * total training batch size / 256. The total training batch size is computed as `training_args.train_batch_size` * `training_args.gradient_accumulation_steps` * `training_args.world_size`.
This replicates the same hyperparameters as used in the original implementation, as shown in the table below.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/mae_pretraining_setting.png"
alt="drawing" width="300"/>
<small> Original hyperparameters. Taken from the <a href="https://arxiv.org/abs/2111.06377">original paper</a>. </small>
<small> Original hyperparameters. Taken from the <a href="https://huggingface.co/papers/2111.06377">original paper</a>. </small>
Alternatively, one can decide to further pre-train an already pre-trained (or fine-tuned) checkpoint from the [hub](https://huggingface.co/). This can be done by setting the `model_name_or_path` argument to "facebook/vit-mae-base" for example.

View File

@@ -37,7 +37,7 @@ from transformers.utils import check_min_version, send_example_telemetry
from transformers.utils.versions import require_version
""" Pre-training a 🤗 ViT model as an MAE (masked autoencoder), as proposed in https://arxiv.org/abs/2111.06377."""
""" Pre-training a 🤗 ViT model as an MAE (masked autoencoder), as proposed in https://huggingface.co/papers/2111.06377."""
logger = logging.getLogger(__name__)

View File

@@ -75,7 +75,7 @@ python run_clm_no_trainer.py \
### GPT-2/GPT and causal language modeling with fill-in-the middle objective
The following example fine-tunes GPT-2 on WikiText-2 but using the Fill-in-middle training objective. FIM objective was proposed in [Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255). They showed that autoregressive language models can learn to infill text after applying a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.
The following example fine-tunes GPT-2 on WikiText-2 but using the Fill-in-middle training objective. FIM objective was proposed in [Efficient Training of Language Models to Fill in the Middle](https://huggingface.co/papers/2207.14255). They showed that autoregressive language models can learn to infill text after applying a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.
We're using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling.

View File

@@ -21,7 +21,7 @@ limitations under the License.
The script [`run_speech_wav2vec2_pretraining_no_trainer.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py) can be used to pre-train a [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html?highlight=wav2vec2) model from scratch.
In the script [`run_speech_wav2vec2_pretraining_no_trainer`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py), a Wav2Vec2 model is pre-trained on audio data alone using [Wav2Vec2's contrastive loss objective](https://arxiv.org/abs/2006.11477).
In the script [`run_speech_wav2vec2_pretraining_no_trainer`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py), a Wav2Vec2 model is pre-trained on audio data alone using [Wav2Vec2's contrastive loss objective](https://huggingface.co/papers/2006.11477).
The following examples show how to fine-tune a `"base"`-sized Wav2Vec2 model as well as a `"large"`-sized Wav2Vec2 model using [`accelerate`](https://github.com/huggingface/accelerate).

View File

@@ -314,7 +314,7 @@ class DataCollatorForWav2Vec2Pretraining:
mask_time_prob (:obj:`float`, `optional`, defaults to :obj:`0.65`):
Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked for the contrastive task.
Note that overlap between masked sequences may decrease the actual percentage of masked vectors.
The default value is taken from the original wav2vec 2.0 article (https://arxiv.org/abs/2006.11477),
The default value is taken from the original wav2vec 2.0 article (https://huggingface.co/papers/2006.11477),
and results in about 49 percent of each sequence being masked on average.
mask_time_length (:obj:`int`, `optional`, defaults to :obj:`10`):
Length of each vector mask span to mask along the time axis in the contrastive task. The default value

View File

@@ -493,7 +493,7 @@ Note that we have added a randomly initialized _adapter layer_ to `wav2vec2-base
`encoder_add_adapter=True`. This adapter sub-samples the output sequence of
`wav2vec2-base` along the time dimension. By default, a single
output vector of `wav2vec2-base` has a receptive field of *ca.* 25ms (*cf.*
Section *4.2* of the [official Wav2Vec2 paper](https://arxiv.org/pdf/2006.11477.pdf)), which represents a little less a single character. On the other hand, BART
Section *4.2* of the [official Wav2Vec2 paper](https://huggingface.co/papers/2006.11477)), which represents a little less a single character. On the other hand, BART
makes use of a sentence-piece tokenizer as an input processor, so that a single
hidden vector of `bart-base` represents *ca.* 4 characters. To better align the
receptive field of the *Wav2Vec2* output vectors with *BART*'s hidden-states in the cross-attention

View File

@@ -8,7 +8,7 @@ This example will demonstrate pre-training language models at the 100M-1B parame
We've tried to ensure that all the practices we show you here are scalable, though - with relatively few changes, the code could be scaled up to much larger models.
Google's gargantuan [PaLM model](https://arxiv.org/abs/2204.02311), with
Google's gargantuan [PaLM model](https://huggingface.co/papers/2204.02311), with
over 500B parameters, is a good example of how far you can go with pure TPU training, though gathering the dataset and the budget to train at that scale is not an easy task!
### Table of contents