From f471979167b399ea0e3af7fa39bc960e954f85f5 Mon Sep 17 00:00:00 2001 From: Ananya Harsh Jha Date: Thu, 21 Mar 2019 15:38:30 -0400 Subject: [PATCH] added GLUE dev set results and details on how to run GLUE tasks --- README.md | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index afc2b6efda..98fd27badb 100644 --- a/README.md +++ b/README.md @@ -927,11 +927,60 @@ Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your mach We showcase several fine-tuning examples based on (and extended from) [the original implementation](https://github.com/google-research/bert/): -- a *sequence-level classifier* on the MRPC classification corpus, +- a *sequence-level classifier* on nine different GLUE tasks, - a *token-level classifier* on the question answering dataset SQuAD, and - a *sequence-level multiple-choice classifier* on the SWAG classification corpus. - a *BERT language model* on another target corpus +#### GLUE results on dev set + +We get the following results on the dev set of GLUE benchmark with an uncased BERT base +model. All experiments were run on a P100 GPU with a batch size of 32. + +| Task | Metric | Result | +|-|-|-| +| CoLA | Matthew's corr. | 57.29 | +| SST-2 | accuracy | 93.00 | +| MRPC | F1/accuracy | 88.85/83.82 | +| STS-B | Pearson/Spearman corr. | 89.70/89.37 | +| QQP | accuracy/F1 | 90.72/87.41 | +| MNLI | matched acc./mismatched acc.| 83.95/84.39 | +| QNLI | accuracy | 89.04 | +| RTE | accuracy | 61.01 | +| WNLI | accuracy | 53.52 | + +Some of these results are significantly different from the ones reported on the test set +of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite. + +Before running anyone of these GLUE tasks you should download the +[GLUE data](https://gluebenchmark.com/tasks) by running +[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) +and unpack it to some directory `$GLUE_DIR`. + +```shell +export GLUE_DIR=/path/to/glue +export TASK_NAME=MRPC + +python run_classifier.py \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --do_lower_case \ + --data_dir $GLUE_DIR/$TASK_NAME \ + --bert_model bert-base-uncased \ + --max_seq_length 128 \ + --train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3.0 \ + --output_dir /tmp/$TASK_NAME/ +``` + +where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI. + +The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'. + +The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor. + #### MRPC This example code fine-tunes BERT on the Microsoft Research Paraphrase