Fix bug in perplexity guide calculations and update perplexity numbers. Fixes #22348 (#22411)

Fix bug in perplexity guide calculations and update perplexity numbers.
This commit is contained in:
fpgaminer
2023-03-28 06:09:17 -07:00
committed by GitHub
parent 32ff06403d
commit ed57c979b9

View File

@@ -115,11 +115,10 @@ for begin_loc in tqdm(range(0, seq_len, stride)):
with torch.no_grad(): with torch.no_grad():
outputs = model(input_ids, labels=target_ids) outputs = model(input_ids, labels=target_ids)
# loss is calculated using CrossEntropyLoss which averages over input tokens. # loss is calculated using CrossEntropyLoss which averages over valid labels
# Multiply it with trg_len to get the summation instead of average. # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
# We will take average over all the tokens to get the true average # to the left by 1.
# in the last step of this example. neg_log_likelihood = outputs.loss
neg_log_likelihood = outputs.loss * trg_len
nlls.append(neg_log_likelihood) nlls.append(neg_log_likelihood)
@@ -127,14 +126,14 @@ for begin_loc in tqdm(range(0, seq_len, stride)):
if end_loc == seq_len: if end_loc == seq_len:
break break
ppl = torch.exp(torch.stack(nlls).sum() / end_loc) ppl = torch.exp(torch.stack(nlls).mean())
``` ```
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction, strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
and the better the reported perplexity will typically be. and the better the reported perplexity will typically be.
When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.44`, which is about the same
as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window
strategy, this jumps down to `16.44`. This is not only a more favorable score, but is calculated in a way that is strategy, this jumps down to `16.45`. This is not only a more favorable score, but is calculated in a way that is
closer to the true autoregressive decomposition of a sequence likelihood. closer to the true autoregressive decomposition of a sequence likelihood.