From 3b74889e8f3109efb36a0002c8aaa3ede164e30e Mon Sep 17 00:00:00 2001 From: Victor Geislinger <9027783+MrGeislinger@users.noreply.github.com> Date: Thu, 4 May 2023 06:56:45 -0700 Subject: [PATCH] Remove typo in perf_train_gpu_many.mdx (#23144) - Excess `w` in the word `bottom` --- docs/source/en/perf_train_gpu_many.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/perf_train_gpu_many.mdx b/docs/source/en/perf_train_gpu_many.mdx index 17eb7b7399..e756732daf 100644 --- a/docs/source/en/perf_train_gpu_many.mdx +++ b/docs/source/en/perf_train_gpu_many.mdx @@ -272,7 +272,7 @@ It's easy to see from the bottom diagram how PP has less dead zones, where GPUs Both parts of the diagram show a parallelism that is of degree 4. That is 4 GPUs are participating in the pipeline. So there is the forward path of 4 pipe stages F0, F1, F2 and F3 and then the return reverse order backward path of B3, B2, B1 and B0. -PP introduces a new hyper-parameter to tune and it's `chunks` which defines how many chunks of data are sent in a sequence through the same pipe stage. For example, in the bottomw diagram you can see that `chunks=4`. GPU0 performs the same forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do their work and only when their work is starting to be complete, GPU0 starts to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0). +PP introduces a new hyper-parameter to tune and it's `chunks` which defines how many chunks of data are sent in a sequence through the same pipe stage. For example, in the bottom diagram you can see that `chunks=4`. GPU0 performs the same forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do their work and only when their work is starting to be complete, GPU0 starts to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0). Note that conceptually this is the same concept as gradient accumulation steps (GAS). Pytorch uses `chunks`, whereas DeepSpeed refers to the same hyper-parameter as GAS.