Fix bug in perplexity guide calculations and update perplexity numbers.
This commit is contained in:
@@ -115,11 +115,10 @@ for begin_loc in tqdm(range(0, seq_len, stride)):
|
|||||||
with torch.no_grad():
|
with torch.no_grad():
|
||||||
outputs = model(input_ids, labels=target_ids)
|
outputs = model(input_ids, labels=target_ids)
|
||||||
|
|
||||||
# loss is calculated using CrossEntropyLoss which averages over input tokens.
|
# loss is calculated using CrossEntropyLoss which averages over valid labels
|
||||||
# Multiply it with trg_len to get the summation instead of average.
|
# N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
|
||||||
# We will take average over all the tokens to get the true average
|
# to the left by 1.
|
||||||
# in the last step of this example.
|
neg_log_likelihood = outputs.loss
|
||||||
neg_log_likelihood = outputs.loss * trg_len
|
|
||||||
|
|
||||||
nlls.append(neg_log_likelihood)
|
nlls.append(neg_log_likelihood)
|
||||||
|
|
||||||
@@ -127,14 +126,14 @@ for begin_loc in tqdm(range(0, seq_len, stride)):
|
|||||||
if end_loc == seq_len:
|
if end_loc == seq_len:
|
||||||
break
|
break
|
||||||
|
|
||||||
ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
|
ppl = torch.exp(torch.stack(nlls).mean())
|
||||||
```
|
```
|
||||||
|
|
||||||
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
|
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
|
||||||
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
|
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
|
||||||
and the better the reported perplexity will typically be.
|
and the better the reported perplexity will typically be.
|
||||||
|
|
||||||
When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same
|
When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.44`, which is about the same
|
||||||
as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window
|
as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window
|
||||||
strategy, this jumps down to `16.44`. This is not only a more favorable score, but is calculated in a way that is
|
strategy, this jumps down to `16.45`. This is not only a more favorable score, but is calculated in a way that is
|
||||||
closer to the true autoregressive decomposition of a sequence likelihood.
|
closer to the true autoregressive decomposition of a sequence likelihood.
|
||||||
|
|||||||
Reference in New Issue
Block a user