Update the distributed CPU training on Kubernetes documentation (#32669)

* Update the Kubernetes CPU training example

* Add namespace arg

Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>

---------

Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
This commit is contained in:
Dina Suehiro Jones
2024-08-14 09:36:43 -07:00
committed by GitHub
parent 20a04497a8
commit 6577c77d93

View File

@@ -155,13 +155,20 @@ This example assumes that you have:
The snippet below is an example of a Dockerfile that uses a base image that supports distributed CPU training and then The snippet below is an example of a Dockerfile that uses a base image that supports distributed CPU training and then
extracts a Transformers release to the `/workspace` directory, so that the example scripts are included in the image: extracts a Transformers release to the `/workspace` directory, so that the example scripts are included in the image:
```dockerfile ```dockerfile
FROM intel/ai-workflows:torch-2.0.1-huggingface-multinode-py3.9 FROM intel/intel-optimized-pytorch:2.3.0-pip-multinode
RUN apt-get update -y && \
apt-get install -y --no-install-recommends --fix-missing \
google-perftools \
libomp-dev
WORKDIR /workspace WORKDIR /workspace
# Download and extract the transformers code # Download and extract the transformers code
ARG HF_TRANSFORMERS_VER="4.35.2" ARG HF_TRANSFORMERS_VER="4.44.0"
RUN mkdir transformers && \ RUN pip install --no-cache-dir \
transformers==${HF_TRANSFORMERS_VER} && \
mkdir transformers && \
curl -sSL --retry 5 https://github.com/huggingface/transformers/archive/refs/tags/v${HF_TRANSFORMERS_VER}.tar.gz | tar -C transformers --strip-components=1 -xzf - curl -sSL --retry 5 https://github.com/huggingface/transformers/archive/refs/tags/v${HF_TRANSFORMERS_VER}.tar.gz | tar -C transformers --strip-components=1 -xzf -
``` ```
The image needs to be built and copied to the cluster's nodes or pushed to a container registry prior to deploying the The image needs to be built and copied to the cluster's nodes or pushed to a container registry prior to deploying the
@@ -189,7 +196,6 @@ apiVersion: "kubeflow.org/v1"
kind: PyTorchJob kind: PyTorchJob
metadata: metadata:
name: transformers-pytorchjob name: transformers-pytorchjob
namespace: kubeflow
spec: spec:
elasticPolicy: elasticPolicy:
rdzvBackend: c10d rdzvBackend: c10d
@@ -206,32 +212,27 @@ spec:
- name: pytorch - name: pytorch
image: <image name>:<tag> # Specify the docker image to use for the worker pods image: <image name>:<tag> # Specify the docker image to use for the worker pods
imagePullPolicy: IfNotPresent imagePullPolicy: IfNotPresent
command: command: ["/bin/bash", "-c"]
- torchrun args:
- /workspace/transformers/examples/pytorch/question-answering/run_qa.py - >-
- --model_name_or_path cd /workspace/transformers;
- "google-bert/bert-large-uncased" pip install -r /workspace/transformers/examples/pytorch/question-answering/requirements.txt;
- --dataset_name source /usr/local/lib/python3.10/dist-packages/oneccl_bindings_for_pytorch/env/setvars.sh;
- "squad" torchrun /workspace/transformers/examples/pytorch/question-answering/run_qa.py \
- --do_train --model_name_or_path distilbert/distilbert-base-uncased \
- --do_eval --dataset_name squad \
- --per_device_train_batch_size --do_train \
- "12" --do_eval \
- --learning_rate --per_device_train_batch_size 12 \
- "3e-5" --learning_rate 3e-5 \
- --num_train_epochs --num_train_epochs 2 \
- "2" --max_seq_length 384 \
- --max_seq_length --doc_stride 128 \
- "384" --output_dir /tmp/pvc-mount/output_$(date +%Y%m%d_%H%M%S) \
- --doc_stride --no_cuda \
- "128" --ddp_backend ccl \
- --output_dir --bf16 \
- "/tmp/pvc-mount/output" --use_ipex;
- --no_cuda
- --ddp_backend
- "ccl"
- --use_ipex
- --bf16 # Specify --bf16 if your hardware supports bfloat16
env: env:
- name: LD_PRELOAD - name: LD_PRELOAD
value: "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.9:/usr/local/lib/libiomp5.so" value: "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.9:/usr/local/lib/libiomp5.so"
@@ -244,13 +245,13 @@ spec:
- name: CCL_WORKER_COUNT - name: CCL_WORKER_COUNT
value: "1" value: "1"
- name: OMP_NUM_THREADS # Can be tuned for optimal performance - name: OMP_NUM_THREADS # Can be tuned for optimal performance
- value: "56" value: "240"
resources: resources:
limits: limits:
cpu: 200 # Update the CPU and memory limit values based on your nodes cpu: 240 # Update the CPU and memory limit values based on your nodes
memory: 128Gi memory: 128Gi
requests: requests:
cpu: 200 # Update the CPU and memory request values based on your nodes cpu: 240 # Update the CPU and memory request values based on your nodes
memory: 128Gi memory: 128Gi
volumeMounts: volumeMounts:
- name: pvc-volume - name: pvc-volume
@@ -258,8 +259,8 @@ spec:
- mountPath: /dev/shm - mountPath: /dev/shm
name: dshm name: dshm
restartPolicy: Never restartPolicy: Never
nodeSelector: # Optionally use the node selector to specify what types of nodes to use for the workers nodeSelector: # Optionally use nodeSelector to match a certain node label for the worker pods
node-type: spr node-type: gnr
volumes: volumes:
- name: pvc-volume - name: pvc-volume
persistentVolumeClaim: persistentVolumeClaim:
@@ -287,10 +288,12 @@ set the same CPU and memory amounts for both the resource limits and requests.
After the PyTorchJob spec has been updated with values appropriate for your cluster and training job, it can be deployed After the PyTorchJob spec has been updated with values appropriate for your cluster and training job, it can be deployed
to the cluster using: to the cluster using:
```bash ```bash
kubectl create -f pytorchjob.yaml export NAMESPACE=<specify your namespace>
kubectl create -f pytorchjob.yaml -n ${NAMESPACE}
``` ```
The `kubectl get pods -n kubeflow` command can then be used to list the pods in the `kubeflow` namespace. You should see The `kubectl get pods -n ${NAMESPACE}` command can then be used to list the pods in your namespace. You should see
the worker pods for the PyTorchJob that was just deployed. At first, they will probably have a status of "Pending" as the worker pods for the PyTorchJob that was just deployed. At first, they will probably have a status of "Pending" as
the containers get pulled and created, then the status should change to "Running". the containers get pulled and created, then the status should change to "Running".
``` ```
@@ -303,13 +306,13 @@ transformers-pytorchjob-worker-3 1/1 Running
... ...
``` ```
The logs for worker can be viewed using `kubectl logs -n kubeflow <pod name>`. Add `-f` to stream the logs, for example: The logs for worker can be viewed using `kubectl logs <pod name> -n ${NAMESPACE}`. Add `-f` to stream the logs, for example:
```bash ```bash
kubectl logs -n kubeflow transformers-pytorchjob-worker-0 -f kubectl logs transformers-pytorchjob-worker-0 -n ${NAMESPACE} -f
``` ```
After the training job completes, the trained model can be copied from the PVC or storage location. When you are done After the training job completes, the trained model can be copied from the PVC or storage location. When you are done
with the job, the PyTorchJob resource can be deleted from the cluster using `kubectl delete -f pytorchjob.yaml`. with the job, the PyTorchJob resource can be deleted from the cluster using `kubectl delete -f pytorchjob.yaml -n ${NAMESPACE}`.
## Summary ## Summary