From 38611086d293ea4a5809bcd7fadd8081d55cb74e Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Tue, 19 Dec 2023 10:34:14 -0800 Subject: [PATCH] [docs] Fix mistral link in mixtral.md (#28143) Fix mistral link in mixtral.md --- docs/source/en/model_doc/mixtral.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/mixtral.md b/docs/source/en/model_doc/mixtral.md index 719b3bb9ea..01f18bcd3c 100644 --- a/docs/source/en/model_doc/mixtral.md +++ b/docs/source/en/model_doc/mixtral.md @@ -42,7 +42,7 @@ Mixtral-45B is a decoder-based LM with the following architectural choices: * Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dipatched twice (top 2 routing) and thus the compute (the operation required at each foward computation) is just 2 X sequence_length. -The following implementation details are shared with Mistral AI's first model [mistral](~models/doc/mistral): +The following implementation details are shared with Mistral AI's first model [mistral](mistral): * Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens * GQA (Grouped Query Attention) - allowing faster inference and lower cache size. * Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.