Cleanup documentation for BART, Marian, MBART and Pegasus (#7523)

* Cleanup documentation for BART, Marian, MBART and Pegasus * Cleanup documentation for BART, Marian, MBART and Pegasus
2020-10-05 04:22:12 -04:00
parent 5e941bece2
commit e2c935f561
16 changed files with 594 additions and 284 deletions
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -1,36 +1,51 @@
 MarianMT
 -----------------------------------------------------------------------------------------------------------------------
-**Bugs:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
-@sshleifer. Translations should be similar, but not identical to, output in the test set linked to in each model card.
+
+**Bugs:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
+and assign @sshleifer. 
+
+Translations should be similar, but not identical to, output in the test set linked to in each model card.

 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Each model is about 298 MB on disk, there are 1,000+ models.
+
+- Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented in a model card.
+- Models were originally trained by 
+  `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
+  `Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
+- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
+  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
- The modeling code is the same as ``BartForConditionalGeneration`` with a few minor modifications:
-    - static (sinusoid) positional embeddings (``MarianConfig.static_position_embeddings=True``)
-    - a new final_logits_bias (``MarianConfig.add_bias_logits=True``)
-    - no layernorm_embedding (``MarianConfig.normalize_embedding=False``)
-    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. (Bart uses <s/>)
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
+- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
+    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
+    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
+    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
+    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
+      :obj:`<s/>`),
+- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.

 Naming
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here <https://developers.google.com/admin-sdk/directory/v1/languages>`_, three digit codes require googling "language code {code}".
- Codes formatted like ``es_AR`` are usually ``code_{region}``. That one is spanish documents from Argentina.
+
+- All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
+- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
+  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
+  "language code {code}".
+- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.


 Multilingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``:
-    - if ``src`` is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card, or the Group Members `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
-    - if ``tgt`` is in all caps, the model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text
+All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
+
+    - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
+      looking at the model card, or the Group Members `mapping
+      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
+    - If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by
+      prepending the desired output language to the :obj:`src_text`.
    - You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``

 Example of translating english to many romance languages, using language codes:
@@ -54,12 +69,20 @@ Example of translating english to many romance languages, using language codes:
    # 'Isto deve ir para o português.',
    # 'Y esto al español']

-Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in ``'Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi'``. These still require language codes.
-There are many supported regional language codes, like ``>>es_ES<<`` (Spain) and ``>>es_AR<<`` (Argentina), that do not seem to change translations. I have not found these to provide different results than just using ``>>es<<``.
+Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a
+separator for src or tgt, as in :obj:`Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi`. These still require language
+codes.

-For Example:
-    - ``Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU``: translates from all NORTH_EU languages (see `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special language code like ``>>de<<`` to specify output language.
-    - ``Helsinki-NLP/opus-mt-ROMANCE-en``: translates from many romance languages to english, no codes needed since there is only 1 tgt language.
+There are many supported regional language codes, like :obj:`>>es_ES<<` (Spain) and :obj:`>>es_AR<<` (Argentina), that
+do not seem to change translations. I have not found these to provide different results than just using :obj:`>>es<<`.
+
+For example:
+
+    - `Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU`: translates from all NORTH_EU languages (see `mapping
+      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special
+      language code like :obj:`>>de<<` to specify output language.
+    - `Helsinki-NLP/opus-mt-ROMANCE-en`: translates from many romance languages to english, no codes needed since there
+      is only one target language.



@@ -86,13 +109,6 @@ Code to see available pretrained models:
    suffix = [x.split('/')[1] for x in model_ids]
    multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]

-MarianMTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
-Model API is identical to BartForConditionalGeneration.
-Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
-This class inherits nearly all functionality from ``BartForConditionalGeneration``, see that page for method signatures.

 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -107,5 +123,7 @@ MarianTokenizer
    :members: prepare_seq2seq_batch


+MarianMTModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-
+.. autoclass:: transformers.MarianMTModel