From 7cc35c31040d8bdfcadc274c087d6a73c2036210 Mon Sep 17 00:00:00 2001
From: thomwolf <thomwolf@gmail.com>
Date: Wed, 6 Mar 2019 11:43:21 +0100
Subject: [PATCH] fix openai gpt example and updating readme

---
 README.md                  | 3 ++-
 examples/run_openai_gpt.py | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b298938ca6..4aa2f649d7 100644
--- a/README.md
+++ b/README.md
@@ -899,6 +899,7 @@ The differences with `BertAdam` is that `OpenAIGPTAdam` compensate for bias as i
 |-|-|
 | [Training large models: introduction, tools and examples](#Training-large-models-introduction,-tools-and-examples) | How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models |
 | [Fine-tuning with BERT: running the examples](#Fine-tuning-with-BERT-running-the-examples) | Running the examples in [`./examples`](./examples/): `extract_classif.py`, `run_classifier.py`, `run_squad.py` and `run_lm_finetuning.py` |
+| [Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2](#Fine-tuning-with-OpenAI-GPT-Transformer-XL-and-GPT-2) | Running the examples in [`./examples`](./examples/): `run_openai_gpt.py`, `run_transfo_xl.py` and `run_gpt2.py` |
 | [Fine-tuning BERT-large on GPUs](#Fine-tuning-BERT-large-on-GPUs) | How to fine tune `BERT large`|
 
 ### Training large models: introduction, tools and examples
@@ -1091,7 +1092,7 @@ python run_openai_gpt.py \
   --train_batch_size 16 \
 ```
 
-This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 86.4% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).
+This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).
 
 #### Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset
 
diff --git a/examples/run_openai_gpt.py b/examples/run_openai_gpt.py
index 97e77e720b..ee30a7a0a4 100644
--- a/examples/run_openai_gpt.py
+++ b/examples/run_openai_gpt.py
@@ -210,6 +210,7 @@ def main():
                 loss = args.lm_coef * losses[0] + losses[1]
                 loss.backward()
                 optimizer.step()
+                optimizer.zero_grad()
                 tr_loss += loss.item()
                 exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
                 nb_tr_steps += 1