[docs] Redesign (#31757)
* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
@@ -13,69 +13,57 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Efficient Training on Multiple CPUs
|
||||
# Distributed CPUs
|
||||
|
||||
When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling
|
||||
distributed CPU training efficiently on [bare metal](#usage-in-trainer) and [Kubernetes](#usage-with-kubernetes).
|
||||
CPUs are commonly available and can be a cost-effective training option when GPUs are unavailable. When training large models or if a single CPU is too slow, distributed training with CPUs can help speed up training.
|
||||
|
||||
## Intel® oneCCL Bindings for PyTorch
|
||||
This guide demonstrates how to perform distributed training with multiple CPUs using a [DistributedDataParallel (DDP)](./perf_train_gpu_many#distributeddataparallel) strategy on bare metal with [`Trainer`] and a Kubernetes cluster. All examples shown in this guide depend on the [Intel oneAPI HPC Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html).
|
||||
|
||||
[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html) and [oneCCL specification](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html).
|
||||
There are two toolkits you'll need from Intel oneAPI.
|
||||
|
||||
Module `oneccl_bindings_for_pytorch` (`torch_ccl` before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now
|
||||
1. [oneCCL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html) includes efficient implementations of collectives commonly used in deep learning such as all-gather, all-reduce, and reduce-scatter. To install from a prebuilt wheel, make sure you always use the latest release. Refer to the table [here](https://github.com/intel/torch-ccl#install-prebuilt-wheel) to check if a version of oneCCL is supported for a Python and PyTorch version.
|
||||
|
||||
Check more detailed information for [oneccl_bind_pt](https://github.com/intel/torch-ccl).
|
||||
|
||||
### Intel® oneCCL Bindings for PyTorch installation
|
||||
|
||||
Wheel files are available for the following Python versions:
|
||||
|
||||
| Extension Version | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Python 3.11 |
|
||||
| :---------------: | :--------: | :--------: | :--------: | :---------: | :---------: |
|
||||
| 2.5.0 | | √ | √ | √ | √ |
|
||||
| 2.4.0 | | √ | √ | √ | √ |
|
||||
| 2.3.0 | | √ | √ | √ | √ |
|
||||
| 2.2.0 | | √ | √ | √ | √ |
|
||||
|
||||
Please run `pip list | grep torch` to get your `pytorch_version`.
|
||||
```bash
|
||||
pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu
|
||||
# installs oneCCL for PyTorch 2.4.0
|
||||
pip install oneccl_bind_pt==2.4.0 -f https://developer.intel.com/ipex-whl-stable-cpu
|
||||
```
|
||||
where `{pytorch_version}` should be your PyTorch version, for instance 2.4.0.
|
||||
Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl).
|
||||
Versions of oneCCL and PyTorch must match.
|
||||
|
||||
> [!TIP]
|
||||
> Refer to the oneCCL [installation](https://github.com/intel/torch-ccl#installation) for more details.
|
||||
|
||||
## Intel® MPI library
|
||||
Use this standards-based MPI implementation to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. This component is part of the Intel® oneAPI HPC Toolkit.
|
||||
|
||||
oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to source the environment before using it.
|
||||
1. [MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html) is a message-passing interface for communications between hardware and networks. The oneCCL toolkit is installed along with MPI, but you need to source the environment as shown below before using it.
|
||||
|
||||
```bash
|
||||
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
|
||||
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
|
||||
```
|
||||
|
||||
#### Intel® Extension for PyTorch installation
|
||||
Lastly, install the [Intex Extension for PyTorch (IPEX)](https://intel.github.io/intel-extension-for-pytorch/index.html) which enables additional performance optimizations for Intel hardware such as weight sharing and better thread runtime control.
|
||||
|
||||
Intel Extension for PyTorch (IPEX) provides performance optimizations for CPU training with both Float32 and BFloat16 (refer to the [single CPU section](./perf_train_cpu) to learn more).
|
||||
```bash
|
||||
pip install intel_extension_for_pytorch==<version_name> -f https://developer.intel.com/ipex-whl-stable-cpu
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Refer to the IPEX [installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation) for more details.
|
||||
|
||||
The following "Usage in Trainer" takes mpirun in Intel® MPI library as an example.
|
||||
## Trainer
|
||||
|
||||
[`Trainer`] supports distributed training with CPUs with the oneCCL backend. Add the `--ddp_backend ccl` parameter in the command arguments to enable it.
|
||||
|
||||
## Usage in Trainer
|
||||
To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--ddp_backend ccl`** in the command arguments.
|
||||
<hfoptions id="distrib-cpu">
|
||||
<hfoption id="single node">
|
||||
|
||||
Let's see an example with the [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)
|
||||
The example below demonstrates the [run_qa.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) script. It enables training with two processes on one Xeon CPU, with one process running per socket.
|
||||
|
||||
> [!TIP]
|
||||
> Tune the variable `OMP_NUM_THREADS/CCL_WORKER_COUNT` for optimal performance.
|
||||
|
||||
The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
|
||||
```shell script
|
||||
export CCL_WORKER_COUNT=1
|
||||
export MASTER_ADDR=127.0.0.1
|
||||
mpirun -n 2 -genv OMP_NUM_THREADS=23 \
|
||||
python3 examples/pytorch/question-answering/run_qa.py \
|
||||
```bash
|
||||
export CCL_WORKER_COUNT=1
|
||||
export MASTER_ADDR=127.0.0.1
|
||||
mpirun -n 2 -genv OMP_NUM_THREADS=23 \
|
||||
python3 run_qa.py \
|
||||
--model_name_or_path google-bert/bert-large-uncased \
|
||||
--dataset_name squad \
|
||||
--do_train \
|
||||
@@ -90,21 +78,31 @@ The following command enables training with 2 processes on one Xeon node, with o
|
||||
--ddp_backend ccl \
|
||||
--use_ipex
|
||||
```
|
||||
The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
|
||||
|
||||
In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.
|
||||
```shell script
|
||||
cat hostfile
|
||||
xxx.xxx.xxx.xxx #node0 ip
|
||||
xxx.xxx.xxx.xxx #node1 ip
|
||||
</hfoption>
|
||||
<hfoption id="multiple nodes">
|
||||
|
||||
Scale the training script to four processes on two Xeon CPUs (`node0` and `node1`) by setting `-n 4` and `ppn 2`. The `ppn` parameter specifies the number of processes per node, with one process running per socket.
|
||||
|
||||
Assume `node0` is the main process and create a configuration file containing the IP addresses of each node (for example, hostfile) and pass the configuration file path as an argument.
|
||||
|
||||
```bash
|
||||
cat hostfile
|
||||
xxx.xxx.xxx.xxx #node0 ip
|
||||
xxx.xxx.xxx.xxx #node1 ip
|
||||
```
|
||||
Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1 with BF16 auto mixed precision:
|
||||
```shell script
|
||||
export CCL_WORKER_COUNT=1
|
||||
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
|
||||
mpirun -f hostfile -n 4 -ppn 2 \
|
||||
|
||||
Run the script below on `node0` to enable DDP on `node0` and `node1` and train with bf16 auto mixed precision.
|
||||
|
||||
> [!TIP]
|
||||
> Tune the variable `OMP_NUM_THREADS/CCL_WORKER_COUNT` for optimal performance.
|
||||
|
||||
```bash
|
||||
export CCL_WORKER_COUNT=1
|
||||
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
|
||||
mpirun -f hostfile -n 4 -ppn 2 \
|
||||
-genv OMP_NUM_THREADS=23 \
|
||||
python3 examples/pytorch/question-answering/run_qa.py \
|
||||
python3 run_qa.py \
|
||||
--model_name_or_path google-bert/bert-large-uncased \
|
||||
--dataset_name squad \
|
||||
--do_train \
|
||||
@@ -121,25 +119,20 @@ Now, run the following command in node0 and **4DDP** will be enabled in node0 an
|
||||
--bf16
|
||||
```
|
||||
|
||||
## Usage with Kubernetes
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
The same distributed training job from the previous section can be deployed to a Kubernetes cluster using the
|
||||
[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/user-guides/pytorch).
|
||||
## Kubernetes
|
||||
|
||||
### Setup
|
||||
Distributed training with CPUs can also be deployed to a Kubernetes cluster with [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/). Before you get started, you should perform the following setup steps.
|
||||
|
||||
This example assumes that you have:
|
||||
* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow)
|
||||
* [`kubectl`](https://kubernetes.io/docs/tasks/tools) installed and configured to access the Kubernetes cluster
|
||||
* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes) that can be used
|
||||
to store datasets and model files. There are multiple options for setting up the PVC including using an NFS
|
||||
[storage class](https://kubernetes.io/docs/concepts/storage/storage-classes) or a cloud storage bucket.
|
||||
* A Docker container that includes your model training script and all the dependencies needed to run the script. For
|
||||
distributed CPU training jobs, this typically includes PyTorch, Transformers, Intel Extension for PyTorch, Intel
|
||||
oneCCL Bindings for PyTorch, and OpenSSH to communicate between the containers.
|
||||
1. Ensure you have access to a Kubernetes cluster with [Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/) installed.
|
||||
1. Install and configure [kubectl](https://kubernetes.io/docs/tasks/tools) to interact with the cluster.
|
||||
1. Set up a [PersistentVolumeClaim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) to store datasets and model files. There are multiple options to choose from, including a [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) or a cloud storage bucket.
|
||||
1. Set up a Docker container for the training script and all required dependencies such as PyTorch, Transformers, IPEX, oneCCL, and OpenSSH to facilitate communicattion between containers.
|
||||
|
||||
The example Dockerfile below uses a base image that supports distributed training with CPUs, and extracts Transformers to the `/workspace` directory to include the training scripts in the image. The image needs to be built and copied to the clusters nodes or pushed to a container registry prior to deployment.
|
||||
|
||||
The snippet below is an example of a Dockerfile that uses a base image that supports distributed CPU training and then
|
||||
extracts a Transformers release to the `/workspace` directory, so that the example scripts are included in the image:
|
||||
```dockerfile
|
||||
FROM intel/intel-optimized-pytorch:2.4.0-pip-multinode
|
||||
|
||||
@@ -157,26 +150,17 @@ RUN pip install --no-cache-dir \
|
||||
mkdir transformers && \
|
||||
curl -sSL --retry 5 https://github.com/huggingface/transformers/archive/refs/tags/v${HF_TRANSFORMERS_VER}.tar.gz | tar -C transformers --strip-components=1 -xzf -
|
||||
```
|
||||
The image needs to be built and copied to the cluster's nodes or pushed to a container registry prior to deploying the
|
||||
PyTorchJob to the cluster.
|
||||
|
||||
### PyTorchJob Specification File
|
||||
### PyTorchJob
|
||||
|
||||
The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch) is used to run the distributed
|
||||
training job on the cluster. The yaml file for the PyTorchJob defines parameters such as:
|
||||
* The name of the PyTorchJob
|
||||
* The number of replicas (workers)
|
||||
* The python script and it's parameters that will be used to run the training job
|
||||
* The types of resources (node selector, memory, and CPU) needed for each worker
|
||||
* The image/tag for the Docker container to use
|
||||
* Environment variables
|
||||
* A volume mount for the PVC
|
||||
[PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) is an extension of the Kubernetes API for running PyTorch training jobs on Kubernetes. It includes a yaml file that defines the training jobs parameters such as the name of the PyTorchJob, number of workers, types of resources for each worker, and more.
|
||||
|
||||
The volume mount defines a path where the PVC will be mounted in the container for each worker pod. This location can be
|
||||
used for the dataset, checkpoint files, and the saved model after training completes.
|
||||
The volume mount parameter is a path to where the PVC is mounted in the container for each worker pod. The PVC is typically used to hold the dataset, checkpoint files, and the model after it has finished training.
|
||||
|
||||
The example yaml file below sets up four workers on the [run_qa.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) script. Adapt the yaml file based on your training script and number of nodes in your cluster.
|
||||
|
||||
The CPU resource limits and requests are defined in [CPU units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu). One CPU unit is equivalent to one physical CPU core or virtual core. The CPU units defined in the yaml file should be less than the amount of available CPU and memory capacity of a single machine in order to leave some resources for kubelet and the system. For a `Guaranteed` [quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod), set the same CPU and memory amounts for both the resource limits and requests.
|
||||
|
||||
The snippet below is an example of a yaml file for a PyTorchJob with 4 workers running the
|
||||
[question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering).
|
||||
```yaml
|
||||
apiVersion: "kubeflow.org/v1"
|
||||
kind: PyTorchJob
|
||||
@@ -255,35 +239,22 @@ spec:
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
```
|
||||
To run this example, update the yaml based on your training script and the nodes in your cluster.
|
||||
|
||||
<Tip>
|
||||
|
||||
The CPU resource limits/requests in the yaml are defined in
|
||||
[cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)
|
||||
where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical
|
||||
host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of
|
||||
available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in
|
||||
order to leave some resources for the kubelet and OS. In order to get ["guaranteed"](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed)
|
||||
[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod) for the worker pods,
|
||||
set the same CPU and memory amounts for both the resource limits and requests.
|
||||
|
||||
</Tip>
|
||||
|
||||
### Deploy
|
||||
|
||||
After the PyTorchJob spec has been updated with values appropriate for your cluster and training job, it can be deployed
|
||||
to the cluster using:
|
||||
After you've setup the PyTorchJob yaml file with the appropriate settings for your cluster and training job, deploy it to the cluster with the command below.
|
||||
|
||||
```bash
|
||||
export NAMESPACE=<specify your namespace>
|
||||
|
||||
kubectl create -f pytorchjob.yaml -n ${NAMESPACE}
|
||||
```
|
||||
|
||||
The `kubectl get pods -n ${NAMESPACE}` command can then be used to list the pods in your namespace. You should see
|
||||
the worker pods for the PyTorchJob that was just deployed. At first, they will probably have a status of "Pending" as
|
||||
the containers get pulled and created, then the status should change to "Running".
|
||||
```
|
||||
List the pods in the namespace with `kubectl get pods -n ${NAMESPACE}`. At first, the status may be "Pending" but it should change to "Running" once the containers are pulled and created.
|
||||
|
||||
```bash
|
||||
kubectl get pods -n ${NAMESPACE}
|
||||
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
...
|
||||
transformers-pytorchjob-worker-0 1/1 Running 0 7m37s
|
||||
@@ -293,16 +264,14 @@ transformers-pytorchjob-worker-3 1/1 Running
|
||||
...
|
||||
```
|
||||
|
||||
The logs for worker can be viewed using `kubectl logs <pod name> -n ${NAMESPACE}`. Add `-f` to stream the logs, for example:
|
||||
Inspect the logs for each worker with the following command. Add `-f` to stream the logs.
|
||||
|
||||
```bash
|
||||
kubectl logs transformers-pytorchjob-worker-0 -n ${NAMESPACE} -f
|
||||
```
|
||||
|
||||
After the training job completes, the trained model can be copied from the PVC or storage location. When you are done
|
||||
with the job, the PyTorchJob resource can be deleted from the cluster using `kubectl delete -f pytorchjob.yaml -n ${NAMESPACE}`.
|
||||
Once training is complete, the trained model can be copied from the PVC or storage location. Delete the PyTorchJob resource from the cluster with the command below.
|
||||
|
||||
## Summary
|
||||
|
||||
This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes
|
||||
cluster. Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training
|
||||
performance, and can be used as a template to run your own workload on multiple nodes.
|
||||
```bash
|
||||
kubectl delete -f pytorchjob.yaml -n ${NAMESPACE}
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user