[deepspeed docs] Megatron-Deepspeed info (#15488)

2022-02-04 11:15:13 -08:00
parent bbe9c6981b
commit 31be2f45a9
1 changed files with 6 additions and 1 deletions
--- a/docs/source/parallelism.mdx
+++ b/docs/source/parallelism.mdx
@@ -308,9 +308,14 @@ ZeRO stage 3 is not a good choice either for the same reason - more inter-node c
 And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1 optimizer states can be offloaded to CPU.
 Implementations:
- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)
+- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo.
 - [OSLO](https://github.com/tunib-ai/oslo)
 Important papers:
 - [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model](
 https://arxiv.org/abs/2201.11990)
 🤗 Transformers status: not yet implemented, since we have no PP and TP.
 ## FlexFlow