[Docs] Model_doc structure/clarity improvements (#26876)
* first batch of structure improvements for model_docs * second batch of structure improvements for model_docs * more structure improvements for model_docs * more structure improvements for model_docs * structure improvements for cv model_docs * more structural refactoring * addressed feedback about image processors
This commit is contained in:
@@ -37,22 +37,24 @@ improvements to counteract overfitting while training on thousands of tasks. Cri
|
||||
a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety.
|
||||
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.*
|
||||
|
||||
Tips:
|
||||
This model was contributed by [Arthur Zucker](https://huggingface.co/ArtZucker).
|
||||
The original code can be found [here](https://github.com/facebookresearch/fairseq).
|
||||
|
||||
## Usage tips
|
||||
|
||||
- M2M100ForConditionalGeneration is the base model for both NLLB and NLLB MoE
|
||||
- The NLLB-MoE is very similar to the NLLB model, but it's feed forward layer is based on the implementation of SwitchTransformers.
|
||||
- The tokenizer is the same as the NLLB models.
|
||||
|
||||
This model was contributed by [Arthur Zucker](https://huggingface.co/ArtZucker).
|
||||
The original code can be found [here](https://github.com/facebookresearch/fairseq).
|
||||
|
||||
## Implementation differences with SwitchTransformers
|
||||
|
||||
The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the
|
||||
highest predicted probabilities from the gating network, and the remaining experts are ignored. In `SwitchTransformers`, only the top-1 probabilities are computed,
|
||||
which means that tokens have less probability of being forwarded. Moreover, if a token is not routed to any expert, `SwitchTransformers` still adds its unmodified hidden
|
||||
states (kind of like a residual connection) while they are masked in `NLLB`'s top-2 routing mechanism.
|
||||
|
||||
## Generating with NLLB-MoE
|
||||
|
||||
The available checkpoints require around 350GB of storage. Make sure to use `accelerate` if you do not have enough RAM on your machine.
|
||||
|
||||
While generating the target text set the `forced_bos_token_id` to the target language id. The following
|
||||
@@ -99,7 +101,7 @@ See example below for a translation from romanian to german:
|
||||
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
|
||||
```
|
||||
|
||||
## Documentation resources
|
||||
## Resources
|
||||
|
||||
- [Translation task guide](../tasks/translation)
|
||||
- [Summarization task guide](../tasks/summarization)
|
||||
|
||||
Reference in New Issue
Block a user