diff --git a/docs/source/examples.rst b/docs/source/examples.rst index 51c8d850b9..7777117b47 100644 --- a/docs/source/examples.rst +++ b/docs/source/examples.rst @@ -68,7 +68,9 @@ GLUE results on dev set ~~~~~~~~~~~~~~~~~~~~~~~ We get the following results on the dev set of GLUE benchmark with an uncased BERT base -model. All experiments were run on a P100 GPU with a batch size of 32. +model (`bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train batch size of 24. Some of +these tasks have a small dataset and training can lead to high variance in the results between different runs. +We report the median on 5 runs (with different seeds) for each of the metrics. .. list-table:: :header-rows: 1 @@ -78,31 +80,31 @@ model. All experiments were run on a P100 GPU with a batch size of 32. - Result * - CoLA - Matthew's corr. - - 57.29 + - 55.75 * - SST-2 - accuracy - - 93.00 + - 92.09 * - MRPC - F1/accuracy - - 88.85/83.82 + - 90.48/86.27 * - STS-B - Pearson/Spearman corr. - - 89.70/89.37 + - 89.03/88.64 * - QQP - accuracy/F1 - - 90.72/87.41 + - 90.92/87.72 * - MNLI - matched acc./mismatched acc. - - 83.95/84.39 + - 83.74/84.06 * - QNLI - accuracy - - 89.04 + - 91.07 * - RTE - accuracy - - 61.01 + - 68.59 * - WNLI - accuracy - - 53.52 + - 43.66 Some of these results are significantly different from the ones reported on the test set