Swag example readme section update with gradient accumulation run.
This commit is contained in:
15
README.md
15
README.md
@@ -441,25 +441,22 @@ python run_swag.py \
|
|||||||
--do_train \
|
--do_train \
|
||||||
--do_eval \
|
--do_eval \
|
||||||
--data_dir $SWAG_DIR/data
|
--data_dir $SWAG_DIR/data
|
||||||
--train_batch_size 4 \
|
--train_batch_size 16 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
--num_train_epochs 3.0 \
|
--num_train_epochs 3.0 \
|
||||||
--max_seq_length 80 \
|
--max_seq_length 80 \
|
||||||
--output_dir /tmp/swag_output/
|
--output_dir /tmp/swag_output/
|
||||||
|
--gradient_accumulation_steps 4
|
||||||
```
|
```
|
||||||
|
|
||||||
Training with the previous hyper-parameters gave us the following results:
|
Training with the previous hyper-parameters gave us the following results:
|
||||||
```
|
```
|
||||||
eval_accuracy = 0.7776167149855043
|
eval_accuracy = 0.8062081375587323
|
||||||
eval_loss = 1.006812262735175
|
eval_loss = 0.5966546792367169
|
||||||
global_step = 55161
|
global_step = 13788
|
||||||
loss = 0.282251750624779
|
loss = 0.06423990014260186
|
||||||
```
|
```
|
||||||
|
|
||||||
The difference with the `81.6%` accuracy announced in the Bert article
|
|
||||||
is probably due to the different `training_batch_size` (here 4 and 16
|
|
||||||
in the article).
|
|
||||||
|
|
||||||
## Fine-tuning BERT-large on GPUs
|
## Fine-tuning BERT-large on GPUs
|
||||||
|
|
||||||
The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.
|
The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.
|
||||||
|
|||||||
Reference in New Issue
Block a user