From 36fc401621297d2f917a0b4df4a0718067966c12 Mon Sep 17 00:00:00 2001 From: Hyunwoong Ko Date: Wed, 6 Oct 2021 09:42:12 +0900 Subject: [PATCH] Update parallelism.md (#13892) * Update parallelism.md * Update docs/source/parallelism.md Co-authored-by: Stas Bekman * Update docs/source/parallelism.md Co-authored-by: Stas Bekman * Update docs/source/parallelism.md Co-authored-by: Stas Bekman * Update docs/source/parallelism.md Co-authored-by: Stas Bekman * Update docs/source/parallelism.md Co-authored-by: Stas Bekman * Update docs/source/parallelism.md Co-authored-by: Stas Bekman Co-authored-by: Stas Bekman --- docs/source/parallelism.md | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/docs/source/parallelism.md b/docs/source/parallelism.md index 0d54a97bbf..dd87b8750d 100644 --- a/docs/source/parallelism.md +++ b/docs/source/parallelism.md @@ -296,12 +296,27 @@ Paper: ["Beyond Data and Model Parallelism for Deep Neural Networks" by Zhihao J It performs a sort of 4D Parallelism over Sample-Operator-Attribute-Parameter. -1. Sample = Data Parallelism -2. Operator = part vertical Layer Parallelism, but it can split the layer too - more refined level -3. Attribute = horizontal Model Parallelism (Megatron-LM style) -4. Parameter = Sharded model params +1. Sample = Data Parallelism (sample-wise parallel) +2. Operator = Parallelize a single operation into several sub-operations +3. Attribute = Data Parallelism (length-wise parallel) +4. Parameter = Model Parallelism (regardless of dimension - horizontal or vertical) -and they are working on Pipeline Parallelism. I guess ZeRO-DP is Sample+Parameter in this context. +Examples: +* Sample + +Let's take 10 batches of sequence length 512. If we parallelize them by sample dimension into 2 devices, we get 10 x 512 which becomes be 5 x 2 x 512. + +* Operator + +If we perform layer normalization, we compute std first and mean second, and then we can normalize data. Operator parallelism allows computing std and mean in parallel. So if we parallelize them by operator dimension into 2 devices (cuda:0, cuda:1), first we copy input data into both devices, and cuda:0 computes std, cuda:1 computes mean at the same time. + +* Attribute + +We have 10 batches of 512 length. If we parallelize them by attribute dimension into 2 devices, 10 x 512 will be 10 x 2 x 256. + +* Parameter + +It is similar with tensor model parallelism or naive layer-wise model parallelism. ![flex-flow-soap](imgs/parallelism-flexflow.jpeg)