[docs] Performance docs tidy up, part 1 (#23963)

* first pass at the single gpu doc

* overview: improved clarity and navigation

* WIP

* updated intro and deepspeed sections

* improved torch.compile section

* more improvements

* minor improvements

* make style

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* feedback addressed

* mdx -> md

* link fix

* feedback addressed

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
Maria Khalusova
2023-07-24 08:57:24 -04:00
committed by GitHub
parent 54ba8608d0
commit 75317aefb3
4 changed files with 607 additions and 595 deletions

View File

@@ -20,77 +20,54 @@ rendered properly in your Markdown viewer.
# Performance and Scalability
Training larger and larger transformer models and deploying them to production comes with a range of challenges. During training your model can require more GPU memory than is available or be very slow to train and when you deploy it for inference it can be overwhelmed with the throughput that is required in the production environment. This documentation is designed to help you navigate these challenges and find the best setting for your use-case. We split the guides into training and inference as they come with different challenges and solutions. Then within each of them we have separate guides for different kinds of hardware setting (e.g. single vs. multi-GPU for training or CPU vs. GPU for infrence).
Training large transformer models and deploying them to production present various challenges.
During training, the model may require more GPU memory than available or exhibit slow training speed. In the deployment
phase, the model can struggle to handle the required throughput in a production environment.
![perf_overview](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf_overview.png)
This documentation aims to assist you in overcoming these challenges and finding the optimal setting for your use-case.
The guides are divided into training and inference sections, as each comes with different challenges and solutions.
Within each section you'll find separate guides for different hardware configurations, such as single GPU vs. multi-GPU
for training or CPU vs. GPU for inference.
This document serves as an overview and entry point for the methods that could be useful for your scenario.
Use this document as your starting point to navigate further to the methods that match your scenario.
## Training
Training transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where you only have a single GPU, but there is also a section about multi-GPU and CPU training (with more coming soon).
Training large transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where
you have a single GPU. The methods that you can apply to improve training efficiency on a single GPU extend to other setups
such as multiple GPU. However, there are also techniques that are specific to multi-GPU or CPU training. We cover them in
separate sections.
<Tip>
Note: Most of the strategies introduced in the single GPU sections (such as mixed precision training or gradient accumulation) are generic and apply to training models in general so make sure to have a look at it before diving into the following sections such as multi-GPU or CPU training.
</Tip>
### Single GPU
Training large models on a single GPU can be challenging but there are a number of tools and methods that make it feasible. In this section methods such as mixed precision training, gradient accumulation and checkpointing, efficient optimizers, as well as strategies to determine the best batch size are discussed.
[Go to single GPU training section](perf_train_gpu_one)
### Multi-GPU
In some cases training on a single GPU is still too slow or won't fit the large model. Moving to a multi-GPU setup is the logical step, but training on multiple GPUs at once comes with new decisions: does each GPU have a full copy of the model or is the model itself also distributed? In this section we look at data, tensor, and pipeline parallism.
[Go to multi-GPU training section](perf_train_gpu_many)
### CPU
[Go to CPU training section](perf_train_cpu)
### TPU
[_Coming soon_](perf_train_tpu)
### Specialized Hardware
[_Coming soon_](perf_train_special)
* [Methods and tools for efficient training on a single GPU](perf_train_gpu_one): start here to learn common approaches that can help optimize GPU memory utilization, speed up the training, or both.
* [Multi-GPU training section](perf_train_gpu_many): explore this section to learn about further optimization methods that apply to a multi-GPU settings, such as data, tensor, and pipeline parallelism.
* [CPU training section](perf_train_cpu): learn about mixed precision training on CPU.
* [Efficient Training on Multiple CPUs](perf_train_cpu_many): learn about distributed CPU training.
* [Training on TPU with TensorFlow](perf_train_tpu_tf): if you are new to TPUs, refer to this section for an opinionated introduction to training on TPUs and using XLA.
* [Custom hardware for training](perf_hardware): find tips and tricks when building your own deep learning rig.
* [Hyperparameter Search using Trainer API](hpo_train)
## Inference
Efficient inference with large models in a production environment can be as challenging as training them. In the following sections we go through the steps to run inference on CPU and single/multi-GPU setups.
Efficient inference with large models in a production environment can be as challenging as training them. In the following
sections we go through the steps to run inference on CPU and single/multi-GPU setups.
### CPU
* [Inference on a single CPU](perf_infer_cpu)
* [Inference on a single GPU](perf_infer_gpu_one)
* [Multi-GPU inference](perf_infer_gpu_many)
* [XLA Integration for TensorFlow Models](tf_xla)
[Go to CPU inference section](perf_infer_cpu)
### Single GPU
## Training and inference
[Go to single GPU inference section](perf_infer_gpu_one)
### Multi-GPU
[Go to multi-GPU inference section](perf_infer_gpu_many)
### Specialized Hardware
[_Coming soon_](perf_infer_special)
## Hardware
In the hardware section you can find tips and tricks when building your own deep learning rig.
[Go to hardware section](perf_hardware)
Here you'll find techniques, tips and tricks that apply whether you are training a model, or running inference with it.
* [Instantiating a big model](big_models)
* [Troubleshooting performance issues](debugging)
## Contribute
This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there.
This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to
make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there.
When making contributions that A is better than B, please try to include a reproducible benchmark and/or a link to the source of that information (unless it comes directly from you).
When making contributions that A is better than B, please try to include a reproducible benchmark and/or a link to the
source of that information (unless it comes directly from you).