From 31be2f45a9e969c2f3f47e9e7ee9070b9ccbbf42 Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Fri, 4 Feb 2022 11:15:13 -0800 Subject: [PATCH] [deepspeed docs] Megatron-Deepspeed info (#15488) --- docs/source/parallelism.mdx | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/source/parallelism.mdx b/docs/source/parallelism.mdx index 61af187136..c832c740ff 100644 --- a/docs/source/parallelism.mdx +++ b/docs/source/parallelism.mdx @@ -308,9 +308,14 @@ ZeRO stage 3 is not a good choice either for the same reason - more inter-node c And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1 optimizer states can be offloaded to CPU. Implementations: -- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) +- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo. - [OSLO](https://github.com/tunib-ai/oslo) +Important papers: + +- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]( +https://arxiv.org/abs/2201.11990) + 🤗 Transformers status: not yet implemented, since we have no PP and TP. ## FlexFlow