diff --git a/docs/source/de/add_new_model.md b/docs/source/de/add_new_model.md index ab169f25e3..3f3317dd8b 100644 --- a/docs/source/de/add_new_model.md +++ b/docs/source/de/add_new_model.md @@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") **7. Implementieren Sie den VorwÀrtspass** Nachdem es Ihnen gelungen ist, die trainierten Gewichte korrekt in die 🀗 Transformers-Implementierung zu laden, sollten Sie nun dafÃŒr sorgen -sicherstellen, dass der Forward Pass korrekt implementiert ist. In [Machen Sie sich mit dem ursprÃŒnglichen Repository vertraut](#34-run-a-pretrained-checkpoint-using-the-original-repository) haben Sie bereits ein Skript erstellt, das einen Forward Pass +sicherstellen, dass der Forward Pass korrekt implementiert ist. In [Machen Sie sich mit dem ursprÃŒnglichen Repository vertraut](#3-4-fÃŒhren-sie-einen-pre-training-checkpoint-mit-dem-original-repository-durch) haben Sie bereits ein Skript erstellt, das einen Forward Pass Durchlauf des Modells unter Verwendung des Original-Repositorys durchfÃŒhrt. Jetzt sollten Sie ein analoges Skript schreiben, das die 🀗 Transformers Implementierung anstelle der Originalimplementierung verwenden. Es sollte wie folgt aussehen: diff --git a/docs/source/de/add_tensorflow_model.md b/docs/source/de/add_tensorflow_model.md index e621100970..23702f2d30 100644 --- a/docs/source/de/add_tensorflow_model.md +++ b/docs/source/de/add_tensorflow_model.md @@ -83,7 +83,7 @@ Sie sich nicht auf eine bestimmte Architektur festgelegt haben, ist es eine gute Wir werden Sie zu den wichtigsten Architekturen fÃŒhren, die auf der TensorFlow-Seite noch fehlen. Seite fehlen. Wenn das spezifische Modell, das Sie mit TensorFlow verwenden möchten, bereits eine Implementierung der TensorFlow-Architektur in 🀗 Transformers, aber es fehlen Gewichte, können Sie direkt in den -Abschnitt [Gewichtskonvertierung](#adding-tensorflow-weights-to-hub) +Abschnitt [Gewichtskonvertierung](#hinzufÃŒgen-von-tensorflow-gewichten-zum--hub) auf dieser Seite. Der Einfachheit halber wird im Rest dieser Anleitung davon ausgegangen, dass Sie sich entschieden haben, mit der TensorFlow-Version von diff --git a/docs/source/en/add_new_model.md b/docs/source/en/add_new_model.md index 87c67fcc96..70f7263e33 100644 --- a/docs/source/en/add_new_model.md +++ b/docs/source/en/add_new_model.md @@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") **7. Implement the forward pass** Having managed to correctly load the pretrained weights into the 🀗 Transformers implementation, you should now make -sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#34-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward +sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#3-4-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward pass of the model using the original repository. Now you should write an analogous script using the 🀗 Transformers implementation instead of the original one. It should look as follows: diff --git a/docs/source/en/add_tensorflow_model.md b/docs/source/en/add_tensorflow_model.md index 7ea81a9fe9..b2ff9bb899 100644 --- a/docs/source/en/add_tensorflow_model.md +++ b/docs/source/en/add_tensorflow_model.md @@ -83,7 +83,7 @@ don't have your eyes set on a specific architecture, asking the 🀗 Transformer maximize your impact - we will guide you towards the most prominent architectures that are missing on the TensorFlow side. If the specific model you want to use with TensorFlow already has a TensorFlow architecture implementation in 🀗 Transformers but is lacking weights, feel free to jump straight into the -[weight conversion section](#adding-tensorflow-weights-to-hub) +[weight conversion section](#adding-tensorflow-weights-to--hub) of this page. For simplicity, the remainder of this guide assumes you've decided to contribute with the TensorFlow version of diff --git a/docs/source/en/attention.md b/docs/source/en/attention.md index 3a4f93b33f..02e4db58f5 100644 --- a/docs/source/en/attention.md +++ b/docs/source/en/attention.md @@ -22,7 +22,7 @@ use a sparse version of the attention matrix to speed up training. ## LSH attention -[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax +[Reformer](model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is modified to mask the current token (except at the first position), because it will give a query and a key equal (so @@ -31,7 +31,7 @@ very similar to each other). Since the hash can be a bit random, several hash fu ## Local attention -[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the +[Longformer](model_doc/longformer) uses local attention: often, the local context (e.g., what are the two tokens to the left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a representation of the whole sentence. @@ -51,7 +51,7 @@ length. ### Axial positional encodings -[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding +[Reformer](model_doc/reformer) uses axial positional encodings: in traditional transformer models, the positional encoding E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with diff --git a/docs/source/en/glossary.md b/docs/source/en/glossary.md index f4c4b1beac..96f5cbd0e6 100644 --- a/docs/source/en/glossary.md +++ b/docs/source/en/glossary.md @@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`]. * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`]. - * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`]. + * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`]. ## I @@ -422,7 +422,7 @@ Models that generate a new sequence from an input, like translation models, or s ### Sharded DDP -Another name for the foundational [ZeRO](#zero-redundancy-optimizer--zero-) concept as used by various other implementations of ZeRO. +Another name for the foundational [ZeRO](#zero-redundancy-optimizer-zero) concept as used by various other implementations of ZeRO. ### stride diff --git a/docs/source/en/index.md b/docs/source/en/index.md index 0d24a355f7..40b2735f9c 100644 --- a/docs/source/en/index.md +++ b/docs/source/en/index.md @@ -1,4 +1,4 @@ -