Merge branch 'master' into RoBERTa

2019-08-08 09:42:05 -04:00
parent 39d72bcc7b 7729ef7381
commit d2cc6b101e
62 changed files with 2082 additions and 1387 deletions
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ These implementations have been tested on several datasets (see the example scri

 ## Installation

-This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0
+This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.0.0+

 ### With pip

@@ -130,6 +130,7 @@ tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
 ```

 ## Quick tour of the fine-tuning/usage scripts
+
 The library comprises several example scripts with SOTA performances for NLU and NLG tasks:

 - `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
@@ -314,7 +315,7 @@ loss = outputs[0]
 # In pytorch-transformers you can also have access to the logits:
 loss, logits = outputs[:2]

-# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
+# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
 model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
 outputs = model(input_ids, labels=labels)
 loss, logits, attentions = outputs
@@ -355,8 +356,13 @@ tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')

 ### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules

-The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
-The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
+The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
+
+- it only implements weights decay correction,
+- schedules are now externals (see below),
+- gradient clipping is now also external (see below).
+
+The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.

 The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.

@@ -365,6 +371,7 @@ Here is a conversion examples from `BertAdam` with a linear warmup and decay sch
 ```python
 # Parameters:
 lr = 1e-3
+max_grad_norm = 1.0
 num_total_steps = 1000
 num_warmup_steps = 100
 warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
@@ -384,8 +391,10 @@ scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_tot
 for batch in train_data:
    loss = model(batch)
    loss.backward()
+    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    scheduler.step()
    optimizer.step()
+    optimizer.zero_grad()
 ```

 ## Citation