[docs] General doc fixes (#28087)

* doc fix friday * deprecated objects * update not_doctested * update toctree
2023-12-18 10:44:09 -08:00
parent 08a6e7a702
commit a52e180a0f
7 changed files with 34 additions and 44 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -146,8 +146,6 @@
      title: Efficient training on CPU
    - local: perf_train_cpu_many
      title: Distributed CPU training
-    - local: perf_train_tpu
-      title: Training on TPUs
    - local: perf_train_tpu_tf
      title: Training on TPU with TensorFlow
    - local: perf_train_special
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@@ -400,12 +400,6 @@ Pipelines available for natural language processing tasks include the following.
    - __call__
    - all

-### NerPipeline
-
-[[autodoc]] NerPipeline
-
-See [`TokenClassificationPipeline`] for all details.
-
 ### QuestionAnsweringPipeline

 [[autodoc]] QuestionAnsweringPipeline
--- a/docs/source/en/perf_infer_gpu_one.md
+++ b/docs/source/en/perf_infer_gpu_one.md
@@ -56,9 +56,24 @@ FlashAttention-2 is currently supported for the following architectures:

 You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request.

-Before you begin, make sure you have FlashAttention-2 installed. For NVIDIA GPUs, the library is installable through pip: `pip install flash-attn --no-build-isolation`. We strongly suggest to refer to the [detailed installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features).
+Before you begin, make sure you have FlashAttention-2 installed.

-FlashAttention-2 is also supported on AMD GPUs, with the current support limited to **Instinct MI210 and Instinct MI250**. We strongly suggest to use the following [Dockerfile](https://github.com/huggingface/optimum-amd/tree/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile) to use FlashAttention-2 on AMD GPUs.
+<hfoptions id="install">
+<hfoption id="NVIDIA">
+
+```bash
+pip install flash-attn --no-build-isolation
+```
+
+We strongly suggest referring to the detailed [installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to learn more about supported hardware and data types!
+
+</hfoption>
+<hfoption id="AMD">
+
+FlashAttention-2 is also supported on AMD GPUs and current support is limited to **Instinct MI210** and **Instinct MI250**. We strongly suggest using this [Dockerfile](https://github.com/huggingface/optimum-amd/tree/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile) to use FlashAttention-2 on AMD GPUs.
+
+</hfoption>
+</hfoptions>

 To enable FlashAttention-2, pass the argument `attn_implementation="flash_attention_2"` to [`~AutoModelForCausalLM.from_pretrained`]:

@@ -80,7 +95,9 @@ model = AutoModelForCausalLM.from_pretrained(

 FlashAttention-2 can only be used when the model's dtype is `fp16` or `bf16`. Make sure to cast your model to the appropriate dtype and load them on a supported device before using FlashAttention-2.

-Note that `use_flash_attention_2=True` can also be used to enable Flash Attention 2, but is deprecated in favor of `attn_implementation="flash_attention_2"`.
+<br>
+
+You can also set `use_flash_attention_2=True` to enable FlashAttention-2 but it is deprecated in favor of `attn_implementation="flash_attention_2"`.
  
 </Tip>

@@ -144,11 +161,11 @@ FlashAttention is more memory efficient, meaning you can train on much larger se
 <img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/llama-2-large-seqlen-padding.png">
 </div>

-## FlashAttention and memory-efficient attention through PyTorch's scaled_dot_product_attention 
+## PyTorch scaled dot product attention

-PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers, and is used by default for `torch>=2.1.1` when an implementation is available.
+PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers and is used by default for `torch>=2.1.1` when an implementation is available.

-For now, Transformers supports inference and training through SDPA for the following architectures:
+For now, Transformers supports SDPA inference and training for the following architectures:
 * [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel)
 * [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel)
 * [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
@@ -156,9 +173,13 @@ For now, Transformers supports inference and training through SDPA for the follo
 * [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
 * [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)

-Note that FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type before using it.
+<Tip>

-By default, `torch.nn.functional.scaled_dot_product_attention` selects the most performant kernel available, but to check whether a backend is available in a given setting (hardware, problem size), you can use [`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel) as a context manager:
+FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first.
+
+</Tip>
+
+By default, SDPA selects the most performant kernel available but you can check whether a backend is available in a given setting (hardware, problem size) with [`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel) as a context manager:

 ```diff
 import torch
@@ -178,7 +199,7 @@ inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```

-If you see a bug with the traceback below, try using nightly version of PyTorch which may have broader coverage for FlashAttention:
+If you see a bug with the traceback below, try using the nightly version of PyTorch which may have broader coverage for FlashAttention:

 ```bash
 RuntimeError: No available kernel. Aborting execution.
@@ -191,11 +212,10 @@ pip3 install -U --pre torch torchvision torchaudio --index-url https://download.

 <Tip warning={true}>

-Part of BetterTransformer features are being upstreamed in Transformers, with native `torch.nn.scaled_dot_product_attention` default support. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to support natively SDPA in Transformers.
+Some BetterTransformer features are being upstreamed to Transformers with default support for native `torch.nn.scaled_dot_product_attention`. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to natively support SDPA in Transformers.

 </Tip>

-
 <Tip>

 Check out our benchmarks with BetterTransformer and scaled dot product attention in the [Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0](https://pytorch.org/blog/out-of-the-box-acceleration/) and learn more about the fastpath execution in the [BetterTransformer](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) blog post.
--- a/docs/source/en/perf_train_tpu.md
+++ b/docs/source/en/perf_train_tpu.md
@@ -1,24 +0,0 @@
-<!--Copyright 2022 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Training on TPUs
-
-<Tip>
-
- Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) and [multi-GPU section](perf_train_gpu_many) are generic and apply to training models in general so make sure to have a look at it before diving into this section.
-
-</Tip>
-
-This document will be completed soon with information on how to train on TPUs.
--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@@ -548,6 +548,7 @@ The benchmarks indicate AWQ quantization is the fastest for inference, text gene
 The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules.

 <figcaption class="text-center text-gray-500 text-lg">Unfused module</figcaption>
+
 |   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
 |-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
 |            1 |               32 |              32 |            60.0984 |           38.4537 | 4.50 GB (5.68%) |
@@ -559,6 +560,7 @@ The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7
 |            1 |             2048 |            2048 |          2927.33   |           35.2676 | 5.73 GB (7.23%) |

 <figcaption class="text-center text-gray-500 text-lg">Fused module</figcaption>
+
 |   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
 |-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
 |            1 |               32 |              32 |            81.4899 |           80.2569 | 4.00 GB (5.05%) |