Add deepspeed test to amd scheduled CI (#27633)
* add deepspeed scheduled test for amd * fix image * add dockerfile * add comment * enable tests * trigger * remove trigger for this branch * trigger * change runner env to trigger the docker build image test * use new docker image * remove test suffix from docker image tag * replace test docker image with original image * push new image * Trigger * add back amd tests * fix typo * add amd tests back * fix * comment until docker image build scheduled test fix * remove deprecated deepspeed build option * upgrade torch * update docker & make tests pass * Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile * fix * tmp disable test * precompile deepspeed to avoid timeout during tests * fix comment * trigger deepspeed tests with new image * comment tests * trigger * add sklearn dependency to fix slow tests * enable back other tests * final update --------- Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: Félix Marty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
This commit is contained in:
@@ -22,7 +22,11 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools ninja git+htt
|
||||
|
||||
ARG REF=main
|
||||
WORKDIR /
|
||||
|
||||
# Invalidate docker cache from here if new commit is available.
|
||||
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
|
||||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
|
||||
|
||||
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video]
|
||||
|
||||
RUN python3 -m pip uninstall -y tensorflow flax
|
||||
|
||||
45
docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
Normal file
45
docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
Normal file
@@ -0,0 +1,45 @@
|
||||
FROM rocm/dev-ubuntu-22.04:5.6
|
||||
LABEL maintainer="Hugging Face"
|
||||
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
ARG PYTORCH='2.1.1'
|
||||
ARG TORCH_VISION='0.16.1'
|
||||
ARG TORCH_AUDIO='2.1.1'
|
||||
ARG ROCM='5.6'
|
||||
|
||||
RUN apt update && \
|
||||
apt install -y --no-install-recommends \
|
||||
libaio-dev \
|
||||
git \
|
||||
# These are required to build deepspeed.
|
||||
python3-dev \
|
||||
python-is-python3 \
|
||||
rocrand-dev \
|
||||
rocthrust-dev \
|
||||
hipsparse-dev \
|
||||
hipblas-dev \
|
||||
rocblas-dev && \
|
||||
apt clean && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN python3 -m pip install --no-cache-dir --upgrade pip ninja "pydantic<2"
|
||||
RUN python3 -m pip uninstall -y apex torch torchvision torchaudio
|
||||
RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM --no-cache-dir
|
||||
|
||||
# Pre-build DeepSpeed, so it's be ready for testing (to avoid timeout)
|
||||
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache-dir -v --disable-pip-version-check 2>&1
|
||||
|
||||
ARG REF=main
|
||||
WORKDIR /
|
||||
|
||||
# Invalidate docker cache from here if new commit is available.
|
||||
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json
|
||||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
|
||||
|
||||
RUN python3 -m pip install --no-cache-dir ./transformers[accelerate,testing,sentencepiece,sklearn]
|
||||
|
||||
# When installing in editable mode, `transformers` is not recognized as a package.
|
||||
# this line must be added in order for python to be aware of transformers.
|
||||
RUN cd transformers && python3 setup.py develop
|
||||
|
||||
RUN python3 -c "from deepspeed.launcher.runner import main"
|
||||
@@ -34,7 +34,7 @@ RUN python3 -m pip uninstall -y torch-tensorrt
|
||||
|
||||
# recompile apex
|
||||
RUN python3 -m pip uninstall -y apex
|
||||
RUN git clone https://github.com/NVIDIA/apex
|
||||
# RUN git clone https://github.com/NVIDIA/apex
|
||||
# `MAX_JOBS=1` disables parallel building to avoid cpu memory OOM when building image on GitHub Action (standard) runners
|
||||
# TODO: check if there is alternative way to install latest apex
|
||||
# RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
|
||||
@@ -44,7 +44,7 @@ RUN python3 -m pip uninstall -y deepspeed
|
||||
# This has to be run (again) inside the GPU VMs running the tests.
|
||||
# The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests.
|
||||
# TODO: Find out why test fail.
|
||||
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
|
||||
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
|
||||
|
||||
# When installing in editable mode, `transformers` is not recognized as a package.
|
||||
# this line must be added in order for python to be aware of transformers.
|
||||
|
||||
Reference in New Issue
Block a user