Doc styling (#8067)

* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
2020-10-26 18:26:02 -04:00
parent 04a17f8550
commit 08f534d2da
271 changed files with 9726 additions and 8991 deletions
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -1,12 +1,12 @@
 Summary of the models
 =======================================================================================================================

-This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
-`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
+This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
+model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
 <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also checkout the
-:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
-community models <https://huggingface.co/models>`_.
+models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
+page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
+<https://huggingface.co/models>`_.

 Each one of the models in the library falls into one of the following categories:

@@ -19,8 +19,8 @@ Each one of the models in the library falls into one of the following categories
 Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
 previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
 sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
-models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
-A typical example of such models is GPT.
+models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
+typical example of such models is GPT.

 Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
 sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
@@ -30,8 +30,8 @@ sentence classification or token classification. A typical example of such model

 Note that the only difference between autoregressive models and autoencoding models is in the way the model is
 pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
-model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
-introduced.
+model has been used for both types of pretraining, we have put it in the category corresponding to the article where it
+was first introduced.

 Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
 tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
@@ -60,8 +60,8 @@ Original GPT
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
   </a>

-`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
-Alec Radford et al.
+`Improving Language Understanding by Generative Pre-Training
+<https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_, Alec Radford et al.

 The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.

@@ -80,7 +80,8 @@ GPT-2
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
   </a>

-`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
+`Language Models are Unsupervised Multitask Learners
+<https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
 Alec Radford et al.

 A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
@@ -122,8 +123,8 @@ Transformer-XL
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
   </a>

-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
-Zihang Dai et al.
+`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_, Zihang
+Dai et al.

 Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
 RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
@@ -153,8 +154,7 @@ Reformer
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
   </a>

-`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
-Nikita Kitaev et al .
+`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_, Nikita Kitaev et al .

 An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
 include:
@@ -188,8 +188,8 @@ XLNet
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
   </a>

-`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
-Zhilin Yang et al.
+`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_, Zhilin
+Yang et al.

 XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
 tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
@@ -207,7 +207,8 @@ Autoencoding models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
-look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
+look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
+corrupted versions.

 BERT
 -----------------------------------------------------------------------------------------------------------------------
@@ -260,8 +261,8 @@ Same as BERT but with a few tweaks:
    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
    being the vocab size). If E < H, it has less parameters.
  * Layers are split in groups that share parameters (to save memory).
-  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
-    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
+  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and
+    B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
    been swapped or not.

 The library provides a version of the model for masked language modeling, token classification, sentence
@@ -279,8 +280,7 @@ RoBERTa
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
   </a>

-`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
-Yinhan Liu et al.
+`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_, Yinhan Liu et al.

 Same as BERT with better pretraining tricks:

@@ -339,8 +339,8 @@ library provides checkpoints for all of them:
    previous section as well). One of the languages is selected for each training sample, and the model input is a
    sentence of 256 tokens, that may span over several documents in one of those languages.
  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
-    dynamic masking of the tokens.
+    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages,
+    with dynamic masking of the tokens.
  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
    surrounding context in language 1 and the context given by language 2.
@@ -523,20 +523,21 @@ Pegasus
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
   </a>

-`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization 
+`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization
 <https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.

 Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
 objective, called Gap Sentence Generation (GSG).

-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like
-    in BERT)
+  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
+    BERT)
  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
    causal mask to hide the future words like a regular auto-regressive transformer decoder.

 In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are
-masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+masked and are generated together as one output sequence from the remaining sentences, similar to an extractive
+summary.

 The library provides a version of this model for conditional generation, which should be used for summarization.

@@ -571,20 +572,20 @@ T5
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
   </a>

-`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
-Colin Raffel et al.
+`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
+<https://arxiv.org/abs/1910.10683>`_, Colin Raffel et al.

-Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
-each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
+Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at each
+layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
 prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.

 The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
 tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).

-Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
-replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group
-is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder
-is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
+Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with
+individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a
+single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original
+sentence and the target is then the dropped out tokens delimited by their sentinel tokens.

 For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and
 "cute", the encoder input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
@@ -603,13 +604,12 @@ MBart
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
   </a>

-`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan
-Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
+Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

-The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
-and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
-for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
+The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
+for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
+sequence-to-sequence model by denoising full texts in multiple languages,

 The library provides a version of this model for conditional generation.

@@ -636,11 +636,11 @@ ProphetNet
 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.

 ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
-future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at
-each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages
-the model to plan for the future tokens and prevent overfitting on strong local correlations.
-The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism
-in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
+future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
+time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
+to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
+the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main
+self-attention mechanism and a self and n-stream (predict) self-attention mechanism.

 The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for
 summarization.
@@ -682,8 +682,8 @@ et al.

 A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
 model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
-(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
-resnet to the hidden state dimension of the transformer).
+(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the resnet to
+the hidden state dimension of the transformer).

 The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
 model know which part of the input vector corresponds to the text and which to the image.
@@ -691,8 +691,7 @@ model know which part of the input vector corresponds to the text and which to t
 The pretrained model only works for classification.

 ..
-    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
-    TODO: write this page
+    More information in this :doc:`model documentation </model_doc/mmbt.html>`. TODO: write this page

 .. _retrieval-based-models:

@@ -714,19 +713,22 @@ DPR
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
   </a>

-`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
-Vladimir Karpukhin et al.
+`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_, Vladimir Karpukhin et
+al.

-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
+Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering
+research.


 DPR consists in three models:

  * Question encoder: encode questions as vectors
  * Context encoder: encode contexts as vectors
-  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
+  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the
+    inferred span actually answers the question).

-DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
+DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and
+then it calls the reader with the question and the retrieved documents to get the answer.

 RAG
 -----------------------------------------------------------------------------------------------------------------------
@@ -740,12 +742,14 @@ RAG
       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
   </a>

-`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
-Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
+`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_, Patrick Lewis,
+Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
+Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
-RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
-The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq
+models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and
+seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation
+to adapt to downstream tasks.

 The two models RAG-Token and RAG-Sequence are available for generation.

@@ -764,19 +768,19 @@ use a sparse version of the attention matrix to speed up training.
 **LSH attention**

 :ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
-dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can  consider only
+dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
 the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
-modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
-similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
-a n_rounds parameter) and then are averaged together.
+modified to mask the current token (except at the first position), because it will give a query and a key equal (so
+very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
+(determined by a n_rounds parameter) and then are averaged together.

 .. _local-attention:

 **Local attention**

-:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
-right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
-last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
+:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the
+left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
+window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
 representation of the whole sentence.

 Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
@@ -799,8 +803,9 @@ Other tricks

 :ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
 E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
-hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
-E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
-and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
-time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
-:math:`j // l1` in E2.
+hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
+that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
+dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` and
+:math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for time
+step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and :math:`j // l1`
+in E2.