Update dev results on GLUE (bert-base-uncased) w/ median on 5 runs

2019-08-21 03:43:29 +00:00
parent 07681b6b58
commit 6f877d9daf
1 changed files with 12 additions and 10 deletions
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@@ -68,7 +68,9 @@ GLUE results on dev set
 ~~~~~~~~~~~~~~~~~~~~~~~

 We get the following results on the dev set of GLUE benchmark with an uncased BERT base
-model. All experiments were run on a P100 GPU with a batch size of 32.
+model (`bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train batch size of 24. Some of 
+these tasks have a small dataset and training can lead to high variance in the results between different runs.
+We report the median on 5 runs (with different seeds) for each of the metrics.

 .. list-table::
   :header-rows: 1
@@ -78,31 +80,31 @@ model. All experiments were run on a P100 GPU with a batch size of 32.
     - Result
   * - CoLA
     - Matthew's corr.
-     - 57.29
+     - 55.75
   * - SST-2
     - accuracy
-     - 93.00
+     - 92.09
   * - MRPC
     - F1/accuracy
-     - 88.85/83.82
+     - 90.48/86.27
   * - STS-B
     - Pearson/Spearman corr.
-     - 89.70/89.37
+     - 89.03/88.64
   * - QQP
     - accuracy/F1
-     - 90.72/87.41
+     - 90.92/87.72
   * - MNLI
     - matched acc./mismatched acc.
-     - 83.95/84.39
+     - 83.74/84.06
   * - QNLI
     - accuracy
-     - 89.04
+     - 91.07
   * - RTE
     - accuracy
-     - 61.01
+     - 68.59
   * - WNLI
     - accuracy
-     - 53.52
+     - 43.66


 Some of these results are significantly different from the ones reported on the test set