From dcb50eaa4b80d3ab75d373c36780c80fb47cfd97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gr=C3=A9gory=20Ch=C3=A2tel?= Date: Wed, 12 Dec 2018 18:17:46 +0100 Subject: [PATCH] Swag example readme section update with gradient accumulation run. --- README.md | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 23cd315c29..0ccf96c42c 100644 --- a/README.md +++ b/README.md @@ -441,25 +441,22 @@ python run_swag.py \ --do_train \ --do_eval \ --data_dir $SWAG_DIR/data - --train_batch_size 4 \ + --train_batch_size 16 \ --learning_rate 2e-5 \ --num_train_epochs 3.0 \ --max_seq_length 80 \ --output_dir /tmp/swag_output/ + --gradient_accumulation_steps 4 ``` Training with the previous hyper-parameters gave us the following results: ``` -eval_accuracy = 0.7776167149855043 -eval_loss = 1.006812262735175 -global_step = 55161 -loss = 0.282251750624779 +eval_accuracy = 0.8062081375587323 +eval_loss = 0.5966546792367169 +global_step = 13788 +loss = 0.06423990014260186 ``` -The difference with the `81.6%` accuracy announced in the Bert article -is probably due to the different `training_batch_size` (here 4 and 16 -in the article). - ## Fine-tuning BERT-large on GPUs The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.