From f471979167b399ea0e3af7fa39bc960e954f85f5 Mon Sep 17 00:00:00 2001
From: Ananya Harsh Jha <ahj265@nyu.edu>
Date: Thu, 21 Mar 2019 15:38:30 -0400
Subject: [PATCH] added GLUE dev set results and details on how to run GLUE
 tasks

---
 README.md | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index afc2b6efda..98fd27badb 100644
--- a/README.md
+++ b/README.md
@@ -927,11 +927,60 @@ Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your mach
 
 We showcase several fine-tuning examples based on (and extended from) [the original implementation](https://github.com/google-research/bert/):
 
-- a *sequence-level classifier* on the MRPC classification corpus,
+- a *sequence-level classifier* on nine different GLUE tasks,
 - a *token-level classifier* on the question answering dataset SQuAD, and
 - a *sequence-level multiple-choice classifier* on the SWAG classification corpus.
 - a *BERT language model* on another target corpus
 
+#### GLUE results on dev set
+
+We get the following results on the dev set of GLUE benchmark with an uncased BERT base 
+model. All experiments were run on a P100 GPU with a batch size of 32.
+
+| Task | Metric | Result |
+|-|-|-|
+| CoLA | Matthew's corr. | 57.29 |
+| SST-2 | accuracy | 93.00 |
+| MRPC | F1/accuracy | 88.85/83.82 |
+| STS-B | Pearson/Spearman corr. | 89.70/89.37 |
+| QQP | accuracy/F1 | 90.72/87.41 |
+| MNLI | matched acc./mismatched acc.| 83.95/84.39 |
+| QNLI | accuracy | 89.04 |
+| RTE | accuracy | 61.01 |
+| WNLI | accuracy | 53.52 |
+
+Some of these results are significantly different from the ones reported on the test set
+of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
+
+Before running anyone of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+
+```shell
+export GLUE_DIR=/path/to/glue
+export TASK_NAME=MRPC
+
+python run_classifier.py \
+  --task_name $TASK_NAME \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/$TASK_NAME \
+  --bert_model bert-base-uncased \
+  --max_seq_length 128 \
+  --train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/$TASK_NAME/
+```
+
+where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
+
+The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.
+
+The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.
+
 #### MRPC
 
 This example code fine-tunes BERT on the Microsoft Research Paraphrase