Doc styling (#8067)

* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
2020-10-26 18:26:02 -04:00
parent 04a17f8550
commit 08f534d2da
271 changed files with 9726 additions and 8991 deletions
--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
@@ -1,86 +1,69 @@
 Perplexity of fixed-length models
 =======================================================================================================================

-Perplexity (PPL) is one of the most common metrics for evaluating language
-models. Before diving in, we should note that the metric applies specifically
-to classical language models (sometimes called autoregressive or causal
-language models) and is not well defined for masked language models like BERT
-(see :doc:`summary of the models <model_summary>`).
+Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
+that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
+models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
+<model_summary>`).

-Perplexity is defined as the exponentiated average log-likelihood of a
-sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`,
-then the perplexity of :math:`X` is,
+Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
+:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,

 .. math::

    \text{PPL}(X)
    = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}

-where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith
-token conditioned on the preceding tokens :math:`x_{<i}` according to our
-model. Intuitively, it can be thought of as an evaluation of the model's
-ability to predict uniformly among the set of specified tokens in a corpus.
-Importantly, this means that the tokenization procedure has a direct impact
-on a model's perplexity which should always be taken into consideration when
-comparing different models.
+where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
+:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
+predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
+procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
+different models.

-This is also equivalent to the exponentiation of the cross-entropy between
-the data and model predictions. For more intuition about perplexity and its
-relationship to Bits Per Character (BPC) and data compression, check out this
-`fantastic blog post on The Gradient
-<https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
+This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
+intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
+`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.

 Calculating PPL with fixed-length models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-If we weren't limited by a model's context size, we would evaluate the
-model's perplexity by autoregressively factorizing a sequence and
-conditioning on the entire preceding subsequence at each step, as shown
-below.
+If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
+factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.

 .. image:: imgs/ppl_full.gif
    :width: 600
    :alt: Full decomposition of a sequence with unlimited context length

-When working with approximate models, however, we typically have a constraint
-on the number of tokens the model can process. The largest version
-of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024
-tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
-:math:`t` is greater than 1024.
+When working with approximate models, however, we typically have a constraint on the number of tokens the model can
+process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
+cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.

-Instead, the sequence is typically broken into subsequences equal to the
-model's maximum input size. If a model's max input size is :math:`k`, we
-then approximate the likelihood of a token :math:`x_t` by conditioning only
-on the :math:`k-1` tokens that precede it rather than the entire context.
-When evaluating the model's perplexity of a sequence, a tempting but
-suboptimal approach is to break the sequence into disjoint chunks and
-add up the decomposed log-likelihoods of each segment independently.
+Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
+input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
+:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
+sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
+log-likelihoods of each segment independently.

 .. image:: imgs/ppl_chunked.gif
    :width: 600
    :alt: Suboptimal PPL not taking advantage of full available context

-This is quick to compute since the perplexity of each segment can be computed
-in one forward pass, but serves as a poor approximation of the
-fully-factorized perplexity and will typically yield a higher (worse) PPL
-because the model will have less context at most of the prediction steps.
+This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
+approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
+have less context at most of the prediction steps.

-Instead, the PPL of fixed-length models should be evaluated with a
-sliding-window strategy. This involves repeatedly sliding the
-context window so that the model has more context when making each
-prediction.
+Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
+sliding the context window so that the model has more context when making each prediction.

 .. image:: imgs/ppl_sliding.gif
    :width: 600
    :alt: Sliding window PPL taking advantage of all available context

-This is a closer approximation to the true decomposition of the
-sequence probability and will typically yield a more favorable score.
-The downside is that it requires a separate forward pass for each token in
-the corpus. A good practical compromise is to employ a strided sliding
-window, moving the context by larger strides rather than sliding by 1 token a
-time. This allows computation to procede much faster while still giving the
-model a large context to make predictions at each step.
+This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
+favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
+practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
+1 token a time. This allows computation to procede much faster while still giving the model a large context to make
+predictions at each step.

 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

-We'll load in the WikiText-2 dataset and evaluate the perplexity using a few
-different sliding-window strategies. Since this dataset is small and we're
-just doing one forward pass over the set, we can just load and encode the
-entire dataset in memory.
+We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
+this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
+dataset in memory.

 .. code-block:: python

@@ -106,16 +88,13 @@ entire dataset in memory.
    test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')

-With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels``
-to our model, and the average log-likelihood for each token is returned as
-the loss. With our sliding window approach, however, there is overlap in the
-tokens we pass to the model at each iteration. We don't want the
-log-likelihood for the tokens we're just treating as context to be included
-in our loss, so we can set these targets to ``-100`` so that they are
-ignored. The following is an example of how we could do this with a stride of
-``512``. This means that the model will have at least 512 tokens for context
-when calculating the conditional likelihood of any one token (provided there
-are 512 preceding tokens available to condition on).
+With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
+log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
+the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
+as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
+is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
+for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
+available to condition on).

 .. code-block:: python

@@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on).

    ppl = torch.exp(torch.stack(lls).sum() / end_loc)

-Running this with the stride length equal to the max input length is
-equivalent to the suboptimal, non-sliding-window strategy we discussed above.
-The smaller the stride, the more context the model will have in making each
-prediction, and the better the reported perplexity will typically be.
+Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
+strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
+and the better the reported perplexity will typically be.

-When we run the above with ``stride = 1024``, i.e. no overlap, the resulting
-PPL is ``19.64``, which is about the same as the ``19.93`` reported in the
-GPT-2 paper. By using ``stride = 512`` and thereby employing our striding
-window strategy, this jumps down to ``16.53``. This is not only a more
-favorable score, but is calculated in a way that is closer to the true
-autoregressive decomposition of a sequence likelihood.
+When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
+as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
+strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
+closer to the true autoregressive decomposition of a sequence likelihood.