[Doc] add more MBart and other doc (#6490)

* add mbart example * add Pegasus and MBart in readme * typo * add MBart in Pretrained models * add pre-proc doc * add DPR in readme * fix indent * doc fix
2020-08-17 22:00:26 +05:30
parent f68c873100
commit c9564f5343
5 changed files with 64 additions and 7 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -126,7 +126,7 @@ conversion utilities for the following models:
    Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 23. `Pegasus <https://github.com/google-research/pegasus>`_ (from Google) released with the paper `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
    <https://arxiv.org/abs/1912.08777>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
-24. `MBart <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`_ (from Facebook) released with the paper  `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+24. `MBart <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`_ (from Facebook) released with the paper  `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov,
    Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.  
 25. `Other community models <https://huggingface.co/models>`_, contributed by the `community
    <https://huggingface.co/users>`_.
--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -14,6 +14,45 @@ MBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scal
 The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__


+Training
+~~~~~~~~~~~~~~~~~~~~~
+MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
+As the model is multilingual it expects the sequences in a different format. A special language id token 
+is added in both the source and target text. The source text format is ``X [eos, src_lang_code]`` 
+where ``X`` is the source text. The target text format is ```[tgt_lang_code] X [eos]```. ```bos``` is never used.
+The ```MBartTokenizer.prepare_seq2seq_batch``` handles this automatically and should be used to encode 
+the sequences for seq-2-seq fine-tuning.
+
+- Supervised training
+
+::
+
+    example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
+    expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
+    batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
+    input_ids = batch["input_ids"]
+    target_ids = batch["decoder_input_ids"]
+    decoder_input_ids = target_ids[:, :-1].contiguous()
+    labels = target_ids[:, 1:].clone()
+    model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels) #forward
+
+- Generation
+
+    While generating the target text set the `decoder_start_token_id` to the target language id. 
+    The following example shows how to translate English to Romanian using the ```facebook/mbart-large-en-ro``` model.
+
+::
+
+    from transformers import MBartForConditionalGeneration, MBartTokenizer
+    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
+    tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
+    article = "UN Chief Says There Is No Military Solution in Syria"
+    batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX")
+    translated_tokens = model.generate(**batch, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
+    translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
+    assert translation == "Şeful ONU declară că nu există o soluţie militară în Siria"
+
+
 MBartConfig
 ~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -331,9 +331,6 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``facebook/bart-large-cnn``                                | | 12-layer, 1024-hidden, 16-heads, 406M parameters       (same as base)                                                               |
 |                   |                                                            | | bart-large base architecture finetuned on cnn summarization task                                                                    |
-|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``facebook/mbart-large-en-ro``                             | | 12-layer, 1024-hidden, 16-heads, 880M parameters                                                                                    |
-|                   |                                                            | | bart-large architecture pretrained on cc25 multilingual data , finetuned on WMT english romanian translation.                       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | DialoGPT          | ``DialoGPT-small``                                         | | 12-layer, 768-hidden, 12-heads, 124M parameters                                                                                     |
 |                   |                                                            | | Trained on English text: 147M conversation-like exchanges extracted from Reddit.                                                    |
@@ -361,3 +358,9 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   | ``allenai/longformer-large-4096``                          | | 24-layer, 1024-hidden, 16-heads, ~435M parameters                                                                                   |
 |                   |                                                            | | Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096                                                    |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| MBart             | ``facebook/mbart-large-cc25``                              | | 24-layer, 1024-hidden, 16-heads, 610M parameters                                                                                    |
+|                   |                                                            | | mBART (bart-large architecture) model trained on 25 languages' monolingual corpus                                                   |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``facebook/mbart-large-en-ro``                             | | 24-layer, 1024-hidden, 16-heads, 610M parameters                                                                                    |
+|                   |                                                            | | mbart-large-cc25 model finetuned on WMT english romanian translation.                                                               |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+