Merge branch 'master' into RoBERTa
This commit is contained in:
17
README.md
17
README.md
@@ -26,7 +26,7 @@ These implementations have been tested on several datasets (see the example scri
|
||||
|
||||
## Installation
|
||||
|
||||
This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0
|
||||
This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.0.0+
|
||||
|
||||
### With pip
|
||||
|
||||
@@ -130,6 +130,7 @@ tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
|
||||
```
|
||||
|
||||
## Quick tour of the fine-tuning/usage scripts
|
||||
|
||||
The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
|
||||
|
||||
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
|
||||
@@ -314,7 +315,7 @@ loss = outputs[0]
|
||||
# In pytorch-transformers you can also have access to the logits:
|
||||
loss, logits = outputs[:2]
|
||||
|
||||
# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
|
||||
# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
|
||||
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
|
||||
outputs = model(input_ids, labels=labels)
|
||||
loss, logits, attentions = outputs
|
||||
@@ -355,8 +356,13 @@ tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
|
||||
|
||||
### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
|
||||
|
||||
The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
|
||||
The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
|
||||
The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences:
|
||||
|
||||
- it only implements weights decay correction,
|
||||
- schedules are now externals (see below),
|
||||
- gradient clipping is now also external (see below).
|
||||
|
||||
The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping.
|
||||
|
||||
The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
|
||||
|
||||
@@ -365,6 +371,7 @@ Here is a conversion examples from `BertAdam` with a linear warmup and decay sch
|
||||
```python
|
||||
# Parameters:
|
||||
lr = 1e-3
|
||||
max_grad_norm = 1.0
|
||||
num_total_steps = 1000
|
||||
num_warmup_steps = 100
|
||||
warmup_proportion = float(num_warmup_steps) / float(num_total_steps) # 0.1
|
||||
@@ -384,8 +391,10 @@ scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_tot
|
||||
for batch in train_data:
|
||||
loss = model(batch)
|
||||
loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
|
||||
scheduler.step()
|
||||
optimizer.step()
|
||||
optimizer.zero_grad()
|
||||
```
|
||||
|
||||
## Citation
|
||||
|
||||
Reference in New Issue
Block a user