Fix TPU Convergence bug introduced by PR#6151 (#6488)

Currently with the bug introduced we're taking two optimizer steps per
batch: one global one, where `xm.optimizer_step` injects a CRS between
all cores in training, and one without. This has been affecting training
accuracy (for example, XLNet GLUE on MNLI is not converging, etc.).
This commit is contained in:
Jin Young (Daniel) Sohn
2020-08-14 09:47:37 -07:00
committed by GitHub
parent 895ed8f451
commit 24107c2c83

View File

@@ -572,7 +572,7 @@ class Trainer:
if is_torch_tpu_available(): if is_torch_tpu_available():
xm.optimizer_step(self.optimizer) xm.optimizer_step(self.optimizer)
if self.args.fp16 and _use_native_amp: elif self.args.fp16 and _use_native_amp:
self.scaler.step(self.optimizer) self.scaler.step(self.optimizer)
self.scaler.update() self.scaler.update()
else: else: