Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
This commit is contained in:
@@ -1,86 +1,69 @@
|
||||
Perplexity of fixed-length models
|
||||
=======================================================================================================================
|
||||
|
||||
Perplexity (PPL) is one of the most common metrics for evaluating language
|
||||
models. Before diving in, we should note that the metric applies specifically
|
||||
to classical language models (sometimes called autoregressive or causal
|
||||
language models) and is not well defined for masked language models like BERT
|
||||
(see :doc:`summary of the models <model_summary>`).
|
||||
Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note
|
||||
that the metric applies specifically to classical language models (sometimes called autoregressive or causal language
|
||||
models) and is not well defined for masked language models like BERT (see :doc:`summary of the models
|
||||
<model_summary>`).
|
||||
|
||||
Perplexity is defined as the exponentiated average log-likelihood of a
|
||||
sequence. If we have a tokenized sequence :math:`X = (x_0, x_1, \dots, x_t)`,
|
||||
then the perplexity of :math:`X` is,
|
||||
Perplexity is defined as the exponentiated average log-likelihood of a sequence. If we have a tokenized sequence
|
||||
:math:`X = (x_0, x_1, \dots, x_t)`, then the perplexity of :math:`X` is,
|
||||
|
||||
.. math::
|
||||
|
||||
\text{PPL}(X)
|
||||
= \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}
|
||||
|
||||
where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith
|
||||
token conditioned on the preceding tokens :math:`x_{<i}` according to our
|
||||
model. Intuitively, it can be thought of as an evaluation of the model's
|
||||
ability to predict uniformly among the set of specified tokens in a corpus.
|
||||
Importantly, this means that the tokenization procedure has a direct impact
|
||||
on a model's perplexity which should always be taken into consideration when
|
||||
comparing different models.
|
||||
where :math:`\log p_\theta (x_i|x_{<i})` is the log-likelihood of the ith token conditioned on the preceding tokens
|
||||
:math:`x_{<i}` according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to
|
||||
predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization
|
||||
procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing
|
||||
different models.
|
||||
|
||||
This is also equivalent to the exponentiation of the cross-entropy between
|
||||
the data and model predictions. For more intuition about perplexity and its
|
||||
relationship to Bits Per Character (BPC) and data compression, check out this
|
||||
`fantastic blog post on The Gradient
|
||||
<https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
|
||||
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
|
||||
intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
|
||||
`fantastic blog post on The Gradient <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
|
||||
|
||||
Calculating PPL with fixed-length models
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If we weren't limited by a model's context size, we would evaluate the
|
||||
model's perplexity by autoregressively factorizing a sequence and
|
||||
conditioning on the entire preceding subsequence at each step, as shown
|
||||
below.
|
||||
If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
|
||||
factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.
|
||||
|
||||
.. image:: imgs/ppl_full.gif
|
||||
:width: 600
|
||||
:alt: Full decomposition of a sequence with unlimited context length
|
||||
|
||||
When working with approximate models, however, we typically have a constraint
|
||||
on the number of tokens the model can process. The largest version
|
||||
of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024
|
||||
tokens, so we cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when
|
||||
:math:`t` is greater than 1024.
|
||||
When working with approximate models, however, we typically have a constraint on the number of tokens the model can
|
||||
process. The largest version of :doc:`GPT-2 <model_doc/gpt2>`, for example, has a fixed length of 1024 tokens, so we
|
||||
cannot calculate :math:`p_\theta(x_t|x_{<t})` directly when :math:`t` is greater than 1024.
|
||||
|
||||
Instead, the sequence is typically broken into subsequences equal to the
|
||||
model's maximum input size. If a model's max input size is :math:`k`, we
|
||||
then approximate the likelihood of a token :math:`x_t` by conditioning only
|
||||
on the :math:`k-1` tokens that precede it rather than the entire context.
|
||||
When evaluating the model's perplexity of a sequence, a tempting but
|
||||
suboptimal approach is to break the sequence into disjoint chunks and
|
||||
add up the decomposed log-likelihoods of each segment independently.
|
||||
Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
|
||||
input size is :math:`k`, we then approximate the likelihood of a token :math:`x_t` by conditioning only on the
|
||||
:math:`k-1` tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
|
||||
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
|
||||
log-likelihoods of each segment independently.
|
||||
|
||||
.. image:: imgs/ppl_chunked.gif
|
||||
:width: 600
|
||||
:alt: Suboptimal PPL not taking advantage of full available context
|
||||
|
||||
This is quick to compute since the perplexity of each segment can be computed
|
||||
in one forward pass, but serves as a poor approximation of the
|
||||
fully-factorized perplexity and will typically yield a higher (worse) PPL
|
||||
because the model will have less context at most of the prediction steps.
|
||||
This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor
|
||||
approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
|
||||
have less context at most of the prediction steps.
|
||||
|
||||
Instead, the PPL of fixed-length models should be evaluated with a
|
||||
sliding-window strategy. This involves repeatedly sliding the
|
||||
context window so that the model has more context when making each
|
||||
prediction.
|
||||
Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
|
||||
sliding the context window so that the model has more context when making each prediction.
|
||||
|
||||
.. image:: imgs/ppl_sliding.gif
|
||||
:width: 600
|
||||
:alt: Sliding window PPL taking advantage of all available context
|
||||
|
||||
This is a closer approximation to the true decomposition of the
|
||||
sequence probability and will typically yield a more favorable score.
|
||||
The downside is that it requires a separate forward pass for each token in
|
||||
the corpus. A good practical compromise is to employ a strided sliding
|
||||
window, moving the context by larger strides rather than sliding by 1 token a
|
||||
time. This allows computation to procede much faster while still giving the
|
||||
model a large context to make predictions at each step.
|
||||
This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
|
||||
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
|
||||
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
|
||||
1 token a time. This allows computation to procede much faster while still giving the model a large context to make
|
||||
predictions at each step.
|
||||
|
||||
Example: Calculating perplexity with GPT-2 in 🤗 Transformers
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@@ -95,10 +78,9 @@ Let's demonstrate this process with GPT-2.
|
||||
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
|
||||
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
|
||||
|
||||
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few
|
||||
different sliding-window strategies. Since this dataset is small and we're
|
||||
just doing one forward pass over the set, we can just load and encode the
|
||||
entire dataset in memory.
|
||||
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since
|
||||
this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire
|
||||
dataset in memory.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -106,16 +88,13 @@ entire dataset in memory.
|
||||
test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
|
||||
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
|
||||
|
||||
With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels``
|
||||
to our model, and the average log-likelihood for each token is returned as
|
||||
the loss. With our sliding window approach, however, there is overlap in the
|
||||
tokens we pass to the model at each iteration. We don't want the
|
||||
log-likelihood for the tokens we're just treating as context to be included
|
||||
in our loss, so we can set these targets to ``-100`` so that they are
|
||||
ignored. The following is an example of how we could do this with a stride of
|
||||
``512``. This means that the model will have at least 512 tokens for context
|
||||
when calculating the conditional likelihood of any one token (provided there
|
||||
are 512 preceding tokens available to condition on).
|
||||
With 🤗 Transformers, we can simply pass the ``input_ids`` as the ``labels`` to our model, and the average
|
||||
log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in
|
||||
the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
|
||||
as context to be included in our loss, so we can set these targets to ``-100`` so that they are ignored. The following
|
||||
is an example of how we could do this with a stride of ``512``. This means that the model will have at least 512 tokens
|
||||
for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
|
||||
available to condition on).
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -139,14 +118,11 @@ are 512 preceding tokens available to condition on).
|
||||
|
||||
ppl = torch.exp(torch.stack(lls).sum() / end_loc)
|
||||
|
||||
Running this with the stride length equal to the max input length is
|
||||
equivalent to the suboptimal, non-sliding-window strategy we discussed above.
|
||||
The smaller the stride, the more context the model will have in making each
|
||||
prediction, and the better the reported perplexity will typically be.
|
||||
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
|
||||
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
|
||||
and the better the reported perplexity will typically be.
|
||||
|
||||
When we run the above with ``stride = 1024``, i.e. no overlap, the resulting
|
||||
PPL is ``19.64``, which is about the same as the ``19.93`` reported in the
|
||||
GPT-2 paper. By using ``stride = 512`` and thereby employing our striding
|
||||
window strategy, this jumps down to ``16.53``. This is not only a more
|
||||
favorable score, but is calculated in a way that is closer to the true
|
||||
autoregressive decomposition of a sequence likelihood.
|
||||
When we run the above with ``stride = 1024``, i.e. no overlap, the resulting PPL is ``19.64``, which is about the same
|
||||
as the ``19.93`` reported in the GPT-2 paper. By using ``stride = 512`` and thereby employing our striding window
|
||||
strategy, this jumps down to ``16.53``. This is not only a more favorable score, but is calculated in a way that is
|
||||
closer to the true autoregressive decomposition of a sequence likelihood.
|
||||
|
||||
Reference in New Issue
Block a user