From 31be2f45a9e969c2f3f47e9e7ee9070b9ccbbf42 Mon Sep 17 00:00:00 2001
From: Stas Bekman <stas00@users.noreply.github.com>
Date: Fri, 4 Feb 2022 11:15:13 -0800
Subject: [PATCH] [deepspeed docs] Megatron-Deepspeed info (#15488)

---
 docs/source/parallelism.mdx | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/source/parallelism.mdx b/docs/source/parallelism.mdx
index 61af187136..c832c740ff 100644
--- a/docs/source/parallelism.mdx
+++ b/docs/source/parallelism.mdx
@@ -308,9 +308,14 @@ ZeRO stage 3 is not a good choice either for the same reason - more inter-node c
 And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1 optimizer states can be offloaded to CPU.
 
 Implementations:
-- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)
+- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo.
 - [OSLO](https://github.com/tunib-ai/oslo)
 
+Important papers:
+
+- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model](
+https://arxiv.org/abs/2201.11990)
+
 🤗 Transformers status: not yet implemented, since we have no PP and TP.
 
 ## FlexFlow