[Docs] Fix broken links and syntax issues (#28918)

* Fix model documentation links in attention.md * Fix external link syntax * Fix target anchor names of section links * Fix copyright statement comments * Fix documentation headings
2024-02-08 23:13:35 +01:00
parent d628664688
commit 2749e479f3
36 changed files with 59 additions and 62 deletions
--- a/docs/source/en/add_new_model.md
+++ b/docs/source/en/add_new_model.md
@@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
 **7. Implement the forward pass**

 Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make
-sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#34-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward
+sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#3-4-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward
 pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers
 implementation instead of the original one. It should look as follows:

--- a/docs/source/en/add_tensorflow_model.md
+++ b/docs/source/en/add_tensorflow_model.md
@@ -83,7 +83,7 @@ don't have your eyes set on a specific architecture, asking the 🤗 Transformer
 maximize your impact - we will guide you towards the most prominent architectures that are missing on the TensorFlow
 side. If the specific model you want to use with TensorFlow already has a TensorFlow architecture implementation in
 🤗 Transformers but is lacking weights, feel free to jump straight into the
-[weight conversion section](#adding-tensorflow-weights-to-hub)
+[weight conversion section](#adding-tensorflow-weights-to--hub)
 of this page.

 For simplicity, the remainder of this guide assumes you've decided to contribute with the TensorFlow version of
--- a/docs/source/en/attention.md
+++ b/docs/source/en/attention.md
@@ -22,7 +22,7 @@ use a sparse version of the attention matrix to speed up training.

 ## LSH attention

-[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
+[Reformer](model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
 dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
 the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
 modified to mask the current token (except at the first position), because it will give a query and a key equal (so
@@ -31,7 +31,7 @@ very similar to each other). Since the hash can be a bit random, several hash fu

 ## Local attention

-[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
+[Longformer](model_doc/longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
 left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
 window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
 representation of the whole sentence.
@@ -51,7 +51,7 @@ length.

 ### Axial positional encodings

-[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
+[Reformer](model_doc/reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
 E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the
 hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
 that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
--- a/docs/source/en/glossary.md
+++ b/docs/source/en/glossary.md
@@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw

  * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
  * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
-  * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`].
+  * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].

 ## I

@@ -422,7 +422,7 @@ Models that generate a new sequence from an input, like translation models, or s

 ### Sharded DDP

-Another name for the foundational [ZeRO](#zero-redundancy-optimizer--zero-) concept as used by various other implementations of ZeRO.
+Another name for the foundational [ZeRO](#zero-redundancy-optimizer-zero) concept as used by various other implementations of ZeRO.

 ### stride

--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -1,4 +1,4 @@
-        <!--Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.

 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
--- a/docs/source/en/model_doc/mgp-str.md
+++ b/docs/source/en/model_doc/mgp-str.md
@@ -29,7 +29,7 @@ alt="drawing" width="600"/>

 <small> MGP-STR architecture. Taken from the <a href="https://arxiv.org/abs/2209.03592">original paper</a>. </small>

-MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and SynthText(http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
+MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
 This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR).

 ## Inference example
--- a/docs/source/en/model_doc/pegasus_x.md
+++ b/docs/source/en/model_doc/pegasus_x.md
@@ -26,7 +26,7 @@ The abstract from the paper is the following:

 *While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.*

-This model was contributed by [zphang](<https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus).
+This model was contributed by [zphang](https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus).

 ## Documentation resources

--- a/docs/source/en/model_doc/pvt.md
+++ b/docs/source/en/model_doc/pvt.md
@@ -38,7 +38,7 @@ object detection, instance and semantic segmentation. For example, with a compar
 achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope 
 that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.*

-This model was contributed by [Xrenya](<https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT).
+This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT).


 - PVTv1 on ImageNet-1K
--- a/docs/source/en/model_doc/t5.md
+++ b/docs/source/en/model_doc/t5.md
@@ -60,7 +60,7 @@ for summarization: *summarize: ...*.

 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.

- See the [training](#training), [inference](#inference) and [scripts](#scripts) sections below for all details regarding usage.
+- See the [training](#training), [inference](#inference) and [resources](#resources) sections below for all details regarding usage.

 T5 comes in different sizes:

--- a/docs/source/en/perf_train_gpu_one.md
+++ b/docs/source/en/perf_train_gpu_one.md
@@ -51,7 +51,7 @@ The methods and tools covered in this guide can be classified based on the effec
 | [Data preloading](#data-preloading)                        | Yes                     | No                           |
 | [DeepSpeed Zero](#deepspeed-zero)                          | No                      | Yes                          |
 | [torch.compile](#using-torchcompile)                       | Yes                     | No                           |
-| [Parameter-Efficient Fine Tuning (PEFT)](#peft)            | No                      | Yes                          |
+| [Parameter-Efficient Fine Tuning (PEFT)](#using--peft)            | No                      | Yes                          |
 
 <Tip>

@@ -62,12 +62,12 @@ large model and a small batch size, the memory use will be larger.

 You can combine the above methods to get a cumulative effect. These techniques are available to you whether you are 
 training your model with [`Trainer`] or writing a pure PyTorch loop, in which case you can [configure these optimizations 
-with 🤗 Accelerate](#using-accelerate).
+with 🤗 Accelerate](#using--accelerate).

 If these methods do not result in sufficient gains, you can explore the following options: 
 * [Look into building your own custom Docker container with efficient softare prebuilds](#efficient-software-prebuilds)
 * [Consider a model that uses Mixture of Experts (MoE)](#mixture-of-experts)
-* [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention)
+* [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention-and-flash-attention)

 Finally, if all of the above is still not enough, even after switching to a server-grade GPU like A100, consider moving 
 to a multi-GPU setup. All these approaches are still valid in a multi-GPU setup, plus you can leverage additional parallelism 
@@ -110,7 +110,7 @@ training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumu
 In the above example, your effective batch size becomes 4. 

 Alternatively, use 🤗 Accelerate to gain full control over the training loop. Find the 🤗 Accelerate example 
-[further down in this guide](#using-accelerate).
+[further down in this guide](#using--accelerate).

 While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can 
 result in a more pronounced training slowdown. Consider the following example. Let's say, the `per_device_train_batch_size=4` 
@@ -143,7 +143,7 @@ training_args = TrainingArguments(
 )
 ```

-Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example [further in this guide](#using-accelerate). 
+Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example [further in this guide](#using--accelerate). 

 <Tip>

@@ -179,7 +179,7 @@ To enable mixed precision training, set the `fp16` flag to `True`:
 training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
 ```

-If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example [further in this guide](#using-accelerate). 
+If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example [further in this guide](#using--accelerate). 

 ### BF16

--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@@ -214,7 +214,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="aut

 <Tip warning={true}>

-Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m]() model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
+Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.

 </Tip>

--- a/docs/source/en/tasks/idefics.md
+++ b/docs/source/en/tasks/idefics.md
@@ -36,7 +36,7 @@ being a large model means it requires significant computational resources and in
 this approach suits your use case better than fine-tuning specialized models for each individual task. 

 In this guide, you'll learn how to: 
- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#loading-the-quantized-version-of-the-model)
+- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model)
 - Use IDEFICS for: 
  - [Image captioning](#image-captioning)
  - [Prompted image captioning](#prompted-image-captioning)
--- a/docs/source/en/tasks/prompting.md
+++ b/docs/source/en/tasks/prompting.md
@@ -35,7 +35,7 @@ practices that help to achieve optimal results more consistently.
 This guide covers the prompt engineering best practices to help you craft better LLM prompts and solve various NLP tasks. 
 You'll learn:

- [Basics of prompting](#basic-prompts)
+- [Basics of prompting](#basics-of-prompting)
 - [Best practices of LLM prompting](#best-practices-of-llm-prompting)
 - [Advanced prompting techniques: few-shot prompting and chain-of-thought](#advanced-prompting-techniques)
 - [When to fine-tune instead of prompting](#prompting-vs-fine-tuning)