Updated outdated examples
This commit is contained in:
@@ -12,7 +12,7 @@ similar API between the different models.
|
|||||||
|
|
||||||
## Language model fine-tuning
|
## Language model fine-tuning
|
||||||
|
|
||||||
Based on the script `run_lm_finetuning.py`.
|
Based on the script [`run_lm_finetuning.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_lm_finetuning.py).
|
||||||
|
|
||||||
Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
|
Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
|
||||||
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
|
to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
|
||||||
@@ -52,8 +52,8 @@ The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using th
|
|||||||
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
|
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
|
||||||
pre-training: masked language modeling.
|
pre-training: masked language modeling.
|
||||||
|
|
||||||
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may therefore converge
|
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
|
||||||
slower, but over-fitting would take more epochs.
|
slightly slower (over-fitting takes more epochs).
|
||||||
|
|
||||||
We use the `--mlm` flag so that the script may change its loss function.
|
We use the `--mlm` flag so that the script may change its loss function.
|
||||||
|
|
||||||
@@ -74,6 +74,8 @@ python run_lm_finetuning.py \
|
|||||||
|
|
||||||
## Language generation
|
## Language generation
|
||||||
|
|
||||||
|
Based on the script [`run_generation.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_generation.py).
|
||||||
|
|
||||||
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.
|
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.
|
||||||
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
||||||
can try out the different models available in the library.
|
can try out the different models available in the library.
|
||||||
@@ -88,6 +90,8 @@ python run_generation.py \
|
|||||||
|
|
||||||
## GLUE
|
## GLUE
|
||||||
|
|
||||||
|
Based on the script [`run_glue.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_glue.py).
|
||||||
|
|
||||||
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
||||||
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
|
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
|
||||||
|
|
||||||
@@ -120,13 +124,14 @@ and unpack it to some directory `$GLUE_DIR`.
|
|||||||
export GLUE_DIR=/path/to/glue
|
export GLUE_DIR=/path/to/glue
|
||||||
export TASK_NAME=MRPC
|
export TASK_NAME=MRPC
|
||||||
|
|
||||||
python run_bert_classifier.py \
|
python run_glue.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
--task_name $TASK_NAME \
|
--task_name $TASK_NAME \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_eval \
|
--do_eval \
|
||||||
--do_lower_case \
|
--do_lower_case \
|
||||||
--data_dir $GLUE_DIR/$TASK_NAME \
|
--data_dir $GLUE_DIR/$TASK_NAME \
|
||||||
--bert_model bert-base-uncased \
|
|
||||||
--max_seq_length 128 \
|
--max_seq_length 128 \
|
||||||
--train_batch_size 32 \
|
--train_batch_size 32 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
@@ -160,13 +165,14 @@ and unpack it to some directory `$GLUE_DIR`.
|
|||||||
```bash
|
```bash
|
||||||
export GLUE_DIR=/path/to/glue
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
python run_bert_classifier.py \
|
python run_glue.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
--task_name MRPC \
|
--task_name MRPC \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_eval \
|
--do_eval \
|
||||||
--do_lower_case \
|
--do_lower_case \
|
||||||
--data_dir $GLUE_DIR/MRPC/ \
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
--bert_model bert-base-uncased \
|
|
||||||
--max_seq_length 128 \
|
--max_seq_length 128 \
|
||||||
--train_batch_size 32 \
|
--train_batch_size 32 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
@@ -186,13 +192,14 @@ Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds.
|
|||||||
```bash
|
```bash
|
||||||
export GLUE_DIR=/path/to/glue
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
python run_bert_classifier.py \
|
python run_glue.py \
|
||||||
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
--task_name MRPC \
|
--task_name MRPC \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_eval \
|
--do_eval \
|
||||||
--do_lower_case \
|
--do_lower_case \
|
||||||
--data_dir $GLUE_DIR/MRPC/ \
|
--data_dir $GLUE_DIR/MRPC/ \
|
||||||
--bert_model bert-base-uncased \
|
|
||||||
--max_seq_length 128 \
|
--max_seq_length 128 \
|
||||||
--train_batch_size 32 \
|
--train_batch_size 32 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
@@ -210,8 +217,9 @@ reaches F1 > 92 on MRPC.
|
|||||||
export GLUE_DIR=/path/to/glue
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
python -m torch.distributed.launch \
|
python -m torch.distributed.launch \
|
||||||
--nproc_per_node 8 run_bert_classifier.py \
|
--nproc_per_node 8 run_glue.py \
|
||||||
--bert_model bert-large-uncased-whole-word-masking \
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
--task_name MRPC \
|
--task_name MRPC \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_eval \
|
--do_eval \
|
||||||
@@ -221,7 +229,7 @@ python -m torch.distributed.launch \
|
|||||||
--train_batch_size 8 \
|
--train_batch_size 8 \
|
||||||
--learning_rate 2e-5 \
|
--learning_rate 2e-5 \
|
||||||
--num_train_epochs 3.0 \
|
--num_train_epochs 3.0 \
|
||||||
--output_dir /tmp/mrpc_output/
|
--output_dir /tmp/mrpc_output/
|
||||||
```
|
```
|
||||||
|
|
||||||
Training with these hyper-parameters gave us the following results:
|
Training with these hyper-parameters gave us the following results:
|
||||||
@@ -243,8 +251,9 @@ The following example uses the BERT-large, uncased, whole-word-masking model and
|
|||||||
export GLUE_DIR=/path/to/glue
|
export GLUE_DIR=/path/to/glue
|
||||||
|
|
||||||
python -m torch.distributed.launch \
|
python -m torch.distributed.launch \
|
||||||
--nproc_per_node 8 run_bert_classifier.py \
|
--nproc_per_node 8 run_glue.py \
|
||||||
--bert_model bert-large-uncased-whole-word-masking \
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
--task_name mnli \
|
--task_name mnli \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_eval \
|
--do_eval \
|
||||||
@@ -275,6 +284,8 @@ The results are the following:
|
|||||||
|
|
||||||
## SQuAD
|
## SQuAD
|
||||||
|
|
||||||
|
Based on the script [`run_squad.py`](https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_squad.py).
|
||||||
|
|
||||||
#### Fine-tuning on SQuAD
|
#### Fine-tuning on SQuAD
|
||||||
|
|
||||||
This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
|
This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
|
||||||
@@ -288,8 +299,9 @@ $SQUAD_DIR directory.
|
|||||||
```bash
|
```bash
|
||||||
export SQUAD_DIR=/path/to/SQUAD
|
export SQUAD_DIR=/path/to/SQUAD
|
||||||
|
|
||||||
python run_bert_squad.py \
|
python run_squad.py \
|
||||||
--bert_model bert-base-uncased \
|
--model_type bert \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_predict \
|
--do_predict \
|
||||||
--do_lower_case \
|
--do_lower_case \
|
||||||
@@ -316,9 +328,9 @@ exact_match = 81.22
|
|||||||
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
|
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m torch.distributed.launch --nproc_per_node=8 \
|
python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
|
||||||
run_bert_squad.py \
|
--model_type bert \
|
||||||
--bert_model bert-large-uncased-whole-word-masking \
|
--model_name_or_path bert-base-cased \
|
||||||
--do_train \
|
--do_train \
|
||||||
--do_predict \
|
--do_predict \
|
||||||
--do_lower_case \
|
--do_lower_case \
|
||||||
|
|||||||
Reference in New Issue
Block a user