From 51d732709e5ae424e8fb6c4e58b72057a3e413c2 Mon Sep 17 00:00:00 2001 From: Fanli Lin Date: Sat, 31 May 2025 00:05:07 +0800 Subject: [PATCH] [docs] add xpu environment variable for gpu selection (#38194) * squash commits * rename gpu * rename accelerator * change _toctree.yml * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: sdp Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> --- docs/source/en/_toctree.yml | 4 +- docs/source/en/accelerator_selection.md | 126 ++++++++++++++++++++++++ docs/source/en/gpu_selection.md | 94 ------------------ 3 files changed, 128 insertions(+), 96 deletions(-) create mode 100644 docs/source/en/accelerator_selection.md delete mode 100644 docs/source/en/gpu_selection.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 0bc2750c3a..6269fc5ead 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -129,8 +129,8 @@ title: Hyperparameter search title: Trainer API - sections: - - local: gpu_selection - title: GPU selection + - local: accelerator_selection + title: Accelerator selection - local: accelerate title: Accelerate - local: fsdp diff --git a/docs/source/en/accelerator_selection.md b/docs/source/en/accelerator_selection.md new file mode 100644 index 0000000000..5d5bbc2675 --- /dev/null +++ b/docs/source/en/accelerator_selection.md @@ -0,0 +1,126 @@ + + +# Accelerator selection + +During distributed training, you can specify the number and order of accelerators (CUDA, XPU, MPS, HPU, etc.) to use. This can be useful when you have accelerators with different computing power and you want to use the faster accelerator first. Or you could only use a subset of the available accelerators. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed). + +This guide will show you how to select the number of accelerators to use and the order to use them in. + +## Number of accelerators + +For example, if there are 4 accelerators and you only want to use the first 2, run the command below. + + + + +Use the `--nproc_per_node` to select how many accelerators to use. + +```bash +torchrun --nproc_per_node=2 trainer-program.py ... +``` + + + + +Use `--num_processes` to select how many accelerators to use. + +```bash +accelerate launch --num_processes 2 trainer-program.py ... +``` + + + + +Use `--num_gpus` to select how many GPUs to use. + +```bash +deepspeed --num_gpus 2 trainer-program.py ... +``` + + + + +## Order of accelerators +To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file. + +For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2: + + + + +```bash +CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ... +``` + +Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively. +To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`): + + +```bash +CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ... +``` + +To run without any GPUs: + +```bash +CUDA_VISIBLE_DEVICES= python trainer-program.py ... +``` + +You can also control the order of CUDA devices using `CUDA_DEVICE_ORDER`: + +- Order by PCIe bus ID (matches `nvidia-smi`): + + ```bash + export CUDA_DEVICE_ORDER=PCI_BUS_ID + ``` + +- Order by compute capability (fastest first): + + ```bash + export CUDA_DEVICE_ORDER=FASTEST_FIRST + ``` + + + + +```bash +ZE_AFFINITY_MASK=0,2 torchrun trainer-program.py ... +``` + +Only XPUs 0 and 2 are "visible" to PyTorch and are mapped to `xpu:0` and `xpu:1` respectively. +To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`): + +```bash +ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ... +``` + + +You can also control the order of Intel XPUs with: + +```bash +export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 +``` + +For more information about device enumeration and sorting on Intel XPU, please refer to the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation. + + + + + + +> [!WARNING] +> Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line. diff --git a/docs/source/en/gpu_selection.md b/docs/source/en/gpu_selection.md deleted file mode 100644 index 57623ed74a..0000000000 --- a/docs/source/en/gpu_selection.md +++ /dev/null @@ -1,94 +0,0 @@ - - -# GPU selection - -During distributed training, you can specify the number of GPUs to use and in what order. This can be useful when you have GPUs with different computing power and you want to use the faster GPU first. Or you could only use a subset of the available GPUs. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed). - -This guide will show you how to select the number of GPUs to use and the order to use them in. - -## Number of GPUs - -For example, if there are 4 GPUs and you only want to use the first 2, run the command below. - - - - -Use the `--nproc_per_node` to select how many GPUs to use. - -```bash -torchrun --nproc_per_node=2 trainer-program.py ... -``` - - - - -Use `--num_processes` to select how many GPUs to use. - -```bash -accelerate launch --num_processes 2 trainer-program.py ... -``` - - - - -Use `--num_gpus` to select how many GPUs to use. - -```bash -deepspeed --num_gpus 2 trainer-program.py ... -``` - - - - -### Order of GPUs - -To select specific GPUs to use and their order, configure the `CUDA_VISIBLE_DEVICES` environment variable. It is easiest to set the environment variable in `~/bashrc` or another startup config file. `CUDA_VISIBLE_DEVICES` is used to map which GPUs are used. For example, if there are 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2: - -```bash -CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ... -``` - -Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to `cuda:0` and `cuda:1` respectively. You can also reverse the order of the GPUs to use 2 first. The mapping becomes `cuda:1` for GPU 0 and `cuda:0` for GPU 2. - -```bash -CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ... -``` - -You can also set the `CUDA_VISIBLE_DEVICES` environment variable to an empty value to create an environment without GPUs. - -```bash -CUDA_VISIBLE_DEVICES= python trainer-program.py ... -``` - -> [!WARNING] -> As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line. - -`CUDA_DEVICE_ORDER` is an alternative environment variable you can use to control how the GPUs are ordered. You can order according to the following. - -1. PCIe bus IDs that matches the order of [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) and [`rocm-smi`](https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/.doxygen/docBin/html/index.html) for NVIDIA and AMD GPUs respectively. - -```bash -export CUDA_DEVICE_ORDER=PCI_BUS_ID -``` - -2. GPU compute ability. - -```bash -export CUDA_DEVICE_ORDER=FASTEST_FIRST -``` - -The `CUDA_DEVICE_ORDER` is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set `CUDA_DEVICE_ORDER=FASTEST_FIRST` to always use the newer and faster GPU first (`nvidia-smi` or `rocm-smi` still reports the GPUs in their PCIe order). Or you could also set `export CUDA_VISIBLE_DEVICES=1,0`. \ No newline at end of file