From 96f4828ace2b50a27813d35d67ead5a51e97236d Mon Sep 17 00:00:00 2001 From: Lysandre Debut Date: Tue, 20 Oct 2020 17:02:47 +0200 Subject: [PATCH] Respect the 119 line chars (#7928) --- docs/source/model_summary.rst | 61 +++++++++++++++++++++++------------ 1 file changed, 40 insertions(+), 21 deletions(-) diff --git a/docs/source/model_summary.rst b/docs/source/model_summary.rst index 40524c97b5..3df92455e7 100644 --- a/docs/source/model_summary.rst +++ b/docs/source/model_summary.rst @@ -500,8 +500,8 @@ BART `_, Mike Lewis et al. Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is -fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder, on the -pretraining tasks, a composition of the following transformations are applied: +fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder +, on the pretraining tasks, a composition of the following transformations are applied: * mask random tokens (like in BERT) * delete random tokens @@ -526,12 +526,17 @@ Pegasus `PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization `_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019. -Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training objective, called Gap Sentence Generation (GSG). +Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on +two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training +objective, called Gap Sentence Generation (GSG). - * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in BERT) - * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder. + * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like + in BERT) + * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a + causal mask to hide the future words like a regular auto-regressive transformer decoder. -In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. +In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are +masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. The library provides a version of this model for conditional generation, which should be used for summarization. @@ -577,11 +582,12 @@ The pretraining includes both supervised and self-supervised training. Supervise tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above). Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and -replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the -original sentence and the target is then the dropped out tokens delimited by their sentinel tokens. +replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group +is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder +is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens. -For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and "cute", the encoder -input becomes “My very .” and the target input becomes “ dog is cute .” +For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and +"cute", the encoder input becomes “My very .” and the target input becomes “ dog is cute .” The library provides a version of this model for conditional generation. @@ -597,7 +603,8 @@ MBart Doc -`Multilingual Denoising Pre-training for Neural Machine Translation `_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov +`Multilingual Denoising Pre-training for Neural Machine Translation `_ by Yinhan +Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages @@ -606,11 +613,12 @@ for pre-training a complete sequence-to-sequence model by denoising full texts i The library provides a version of this model for conditional generation. -The `mbart-large-en-ro checkpoint `_ can be used for english -> romanian translation. +The `mbart-large-en-ro checkpoint `_ can be used for english -> +romanian translation. -The `mbart-large-cc25 `_ checkpoint can be finetuned for other translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning. +The `mbart-large-cc25 `_ checkpoint can be finetuned for other +translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning. -.. _multimodal-models: ProphetNet ----------------------------------------------------------------------------------------------------------------------- @@ -624,12 +632,18 @@ ProphetNet Doc -`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, `__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou. +`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, `__ by +Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou. -ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. -The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism. +ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In +future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at +each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages +the model to plan for the future tokens and prevent overfitting on strong local correlations. +The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism +in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism. -The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for summarization. +The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for +summarization. XLM-ProphetNet ----------------------------------------------------------------------------------------------------------------------- @@ -643,11 +657,16 @@ XLM-ProphetNet Doc -`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, `__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou. +`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, `__ by +Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou. -XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was pre-trained on the cross-lingual dataset `XGLUE `__. +XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was +pre-trained on the cross-lingual dataset `XGLUE `__. -The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned versions for headline generation and question generation, respectively. +The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned +versions for headline generation and question generation, respectively. + +.. _multimodal-models: Multimodal models ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^