Fix parameter count in readme for mixtral 45b (#27945)
fix parameter count in readme
This commit is contained in:
@@ -30,7 +30,7 @@ Tips:
|
|||||||
|
|
||||||
|
|
||||||
- The model needs to be converted using the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mixtral/convert_mixtral_weights_to_hf.py).
|
- The model needs to be converted using the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mixtral/convert_mixtral_weights_to_hf.py).
|
||||||
- If the model is quantized to 4bits, a single A100 is enough to fit the entire 84B model.
|
- If the model is quantized to 4bits, a single A100 is enough to fit the entire 45B model.
|
||||||
|
|
||||||
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) .
|
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) .
|
||||||
The original code can be found [here](https://github.com/mistralai/mistral-src).
|
The original code can be found [here](https://github.com/mistralai/mistral-src).
|
||||||
@@ -38,9 +38,9 @@ The original code can be found [here](https://github.com/mistralai/mistral-src).
|
|||||||
|
|
||||||
### Model Details
|
### Model Details
|
||||||
|
|
||||||
Mixtral-84B is a decoder-based LM with the following architectural choices:
|
Mixtral-45B is a decoder-based LM with the following architectural choices:
|
||||||
|
|
||||||
* Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 85B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dipatched twice (top 2 routing) and thus the compute (the operation required at each foward computation) is just 2 X sequence_length.
|
* Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dipatched twice (top 2 routing) and thus the compute (the operation required at each foward computation) is just 2 X sequence_length.
|
||||||
|
|
||||||
The following implementation details are shared with Mistral AI's first model [mistral](~models/doc/mistral):
|
The following implementation details are shared with Mistral AI's first model [mistral](~models/doc/mistral):
|
||||||
* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
|
* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
|
||||||
|
|||||||
Reference in New Issue
Block a user