Update the README of the text classification example (#9237)
* Update the README of the text classification example * Update examples/README.md Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Adapt comment from review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
This commit is contained in:
@@ -69,6 +69,43 @@ Coming soon!
|
|||||||
**Coming soon!**
|
**Coming soon!**
|
||||||
-->
|
-->
|
||||||
|
|
||||||
|
## Distributed training and mixed precision
|
||||||
|
|
||||||
|
All the PyTorch scripts mentioned above work out of the box with distributed training and mixed precision, thanks to
|
||||||
|
the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html). To launch one of them on _n_ GPUS,
|
||||||
|
use the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m torch.distributed.launch \
|
||||||
|
--nproc_per_node number_of_gpu_you_have path_to_script.py \
|
||||||
|
--all_arguments_of_the_script
|
||||||
|
```
|
||||||
|
|
||||||
|
As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
|
||||||
|
classification MNLI task using the `run_glue` script, with 8 GPUs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m torch.distributed.launch \
|
||||||
|
--nproc_per_node 8 text-classification/run_glue.py \
|
||||||
|
--model_name_or_path bert-large-uncased-whole-word-masking \
|
||||||
|
--task_name mnli \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_device_train_batch_size 8 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mnli_output/
|
||||||
|
```
|
||||||
|
|
||||||
|
If you have a GPU with mixed precision capabilities (architecture Pascal or more recent), you can use mixed precision
|
||||||
|
training with PyTorch 1.6.0 or latest, or by installing the [Apex](https://github.com/NVIDIA/apex) library for previous
|
||||||
|
versions. Just add the flag `--fp16` to your command launching one of the scripts mentioned above!
|
||||||
|
|
||||||
|
Using mixed precision training usually results in 2x-speedup for training with the same final results (as shown in
|
||||||
|
[this table](https://github.com/huggingface/transformers/tree/master/examples/text-classification#mixed-precision-training)
|
||||||
|
for text classification).
|
||||||
|
|
||||||
## Running on TPUs
|
## Running on TPUs
|
||||||
|
|
||||||
When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
|
When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
|
||||||
@@ -76,27 +113,34 @@ When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Str
|
|||||||
When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
|
When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
|
||||||
very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
|
very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
|
||||||
|
|
||||||
In this repo, we provide a very simple launcher script named [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
|
In this repo, we provide a very simple launcher script named
|
||||||
Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
|
[xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our
|
||||||
Note that this approach does not work for examples that use `pytorch-lightning`.
|
example scripts on multiple TPU cores without any boilerplate. Just pass a `--num_cores` flag to this script, then your
|
||||||
|
regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for
|
||||||
For example for `run_glue`:
|
`torch.distributed`):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python examples/xla_spawn.py --num_cores 8 \
|
python xla_spawn.py --num_cores num_tpu_you_have \
|
||||||
examples/text-classification/run_glue.py \
|
path_to_script.py \
|
||||||
--model_name_or_path bert-base-cased \
|
--all_arguments_of_the_script
|
||||||
--task_name mnli \
|
|
||||||
--data_dir ./data/glue_data/MNLI \
|
|
||||||
--output_dir ./models/tpu \
|
|
||||||
--overwrite_output_dir \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--num_train_epochs 1 \
|
|
||||||
--save_steps 20000
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Feedback and more use cases and benchmarks involving TPUs are welcome, please share with the community.
|
As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text
|
||||||
|
classification MNLI task using the `run_glue` script, with 8 TPUs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python xla_spawn.py --num_cores 8 \
|
||||||
|
text-classification/run_glue.py \
|
||||||
|
--model_name_or_path bert-large-uncased-whole-word-masking \
|
||||||
|
--task_name mnli \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_device_train_batch_size 8 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3.0 \
|
||||||
|
--output_dir /tmp/mnli_output/
|
||||||
|
```
|
||||||
|
|
||||||
## Logging & Experiment tracking
|
## Logging & Experiment tracking
|
||||||
|
|
||||||
|
|||||||
@@ -14,7 +14,76 @@ See the License for the specific language governing permissions and
|
|||||||
limitations under the License.
|
limitations under the License.
|
||||||
-->
|
-->
|
||||||
|
|
||||||
## GLUE Benchmark
|
# Text classification examples
|
||||||
|
|
||||||
|
## PyTorch version
|
||||||
|
|
||||||
|
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
|
||||||
|
|
||||||
|
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
||||||
|
Evaluation](https://gluebenchmark.com/). This script can fine-tune any of the models on the [hub](https://huggingface.co/models)
|
||||||
|
and can also be used for your own data in a csv or a JSON file (the script might need some tweaks in that case, refer
|
||||||
|
to the comments inside for help).
|
||||||
|
|
||||||
|
GLUE is made up of a total of 9 different tasks. Here is how to run the script on one of them:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export TASK_NAME=mrpc
|
||||||
|
|
||||||
|
python run_glue.py \
|
||||||
|
--model_name_or_path bert-base-cased \
|
||||||
|
--task_name $TASK_NAME \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--max_seq_length 128 \
|
||||||
|
--per_device_train_batch_size 32 \
|
||||||
|
--learning_rate 2e-5 \
|
||||||
|
--num_train_epochs 3 \
|
||||||
|
--output_dir /tmp/$TASK_NAME/
|
||||||
|
```
|
||||||
|
|
||||||
|
where task name can be one of cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte, wnli.
|
||||||
|
|
||||||
|
We get the following results on the dev set of the benchmark with the previous commands (with an exception for MRPC and
|
||||||
|
WNLI which are tiny and where we used 5 epochs isntead of 3). Trainings are seeded so you should obtain the same
|
||||||
|
results with PyTorch 1.6.0 (and close results with different versions), training times are given for information (a
|
||||||
|
single Titan RTX was used):
|
||||||
|
|
||||||
|
| Task | Metric | Result | Training time |
|
||||||
|
|-------|------------------------------|-------------|---------------|
|
||||||
|
| CoLA | Matthew's corr | 56.53 | 3:17 |
|
||||||
|
| SST-2 | Accuracy | 92.32 | 26:06 |
|
||||||
|
| MRPC | F1/Accuracy | 88.85/84.07 | 2:21 |
|
||||||
|
| STS-B | Person/Spearman corr. | 88.64/88.48 | 2:13 |
|
||||||
|
| QQP | Accuracy/F1 | 90.71/87.49 | 2:22:26 |
|
||||||
|
| MNLI | Matched acc./Mismatched acc. | 83.91/84.10 | 2:35:23 |
|
||||||
|
| QNLI | Accuracy | 90.66 | 40:57 |
|
||||||
|
| RTE | Accuracy | 65.70 | 57 |
|
||||||
|
| WNLI | Accuracy | 56.34 | 24 |
|
||||||
|
|
||||||
|
Some of these results are significantly different from the ones reported on the test set of GLUE benchmark on the
|
||||||
|
website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the website.
|
||||||
|
|
||||||
|
### Mixed precision training
|
||||||
|
|
||||||
|
If you have a GPU with mixed precision capabilities (architecture Pascal or more recent), you can use mixed precision
|
||||||
|
training with PyTorch 1.6.0 or latest, or by installing the [Apex](https://github.com/NVIDIA/apex) library for previous
|
||||||
|
versions. Just add the flag `--fp16` to your command launching one of the scripts mentioned above!
|
||||||
|
|
||||||
|
Using mixed precision training usually results in 2x-speedup for training with the same final results:
|
||||||
|
|
||||||
|
| Task | Metric | Result | Training time | Result (FP16) | Training time (FP16) |
|
||||||
|
|-------|------------------------------|-------------|---------------|---------------|----------------------|
|
||||||
|
| CoLA | Matthew's corr | 56.53 | 3:17 | 56.78 | 1:41 |
|
||||||
|
| SST-2 | Accuracy | 92.32 | 26:06 | 91.74 | 13:11 |
|
||||||
|
| MRPC | F1/Accuracy | 88.85/84.07 | 2:21 | 88.12/83.58 | 1:10 |
|
||||||
|
| STS-B | Person/Spearman corr. | 88.64/88.48 | 2:13 | 88.71/88.55 | 1:08 |
|
||||||
|
| QQP | Accuracy/F1 | 90.71/87.49 | 2:22:26 | 90.67/87.43 | 1:11:54 |
|
||||||
|
| MNLI | Matched acc./Mismatched acc. | 83.91/84.10 | 2:35:23 | 84.04/84.06 | 1:17:06 |
|
||||||
|
| QNLI | Accuracy | 90.66 | 40:57 | 90.96 | 20:16 |
|
||||||
|
| RTE | Accuracy | 65.70 | 57 | 65.34 | 29 |
|
||||||
|
| WNLI | Accuracy | 56.34 | 24 | 56.34 | 12 |
|
||||||
|
|
||||||
|
|
||||||
# Run TensorFlow 2.0 version
|
# Run TensorFlow 2.0 version
|
||||||
|
|
||||||
@@ -65,191 +134,8 @@ python run_tf_text_classification.py \
|
|||||||
--max_seq_length 128
|
--max_seq_length 128
|
||||||
```
|
```
|
||||||
|
|
||||||
# Run PyTorch version
|
|
||||||
|
|
||||||
Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
|
## XNLI
|
||||||
|
|
||||||
Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
|
|
||||||
Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
|
|
||||||
|
|
||||||
GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
|
|
||||||
uncased BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
|
|
||||||
batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
|
|
||||||
between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
|
|
||||||
|
|
||||||
| Task | Metric | Result |
|
|
||||||
|-------|------------------------------|-------------|
|
|
||||||
| CoLA | Matthew's corr | 49.23 |
|
|
||||||
| SST-2 | Accuracy | 91.97 |
|
|
||||||
| MRPC | F1/Accuracy | 89.47/85.29 |
|
|
||||||
| STS-B | Person/Spearman corr. | 83.95/83.70 |
|
|
||||||
| QQP | Accuracy/F1 | 88.40/84.31 |
|
|
||||||
| MNLI | Matched acc./Mismatched acc. | 80.61/81.08 |
|
|
||||||
| QNLI | Accuracy | 87.46 |
|
|
||||||
| RTE | Accuracy | 61.73 |
|
|
||||||
| WNLI | Accuracy | 45.07 |
|
|
||||||
|
|
||||||
Some of these results are significantly different from the ones reported on the test set
|
|
||||||
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the
|
|
||||||
website.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export TASK_NAME=MRPC
|
|
||||||
|
|
||||||
python run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name $TASK_NAME \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_device_train_batch_size 32 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/$TASK_NAME/
|
|
||||||
```
|
|
||||||
|
|
||||||
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
|
|
||||||
|
|
||||||
The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
|
|
||||||
In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
|
|
||||||
output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
|
|
||||||
|
|
||||||
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
|
|
||||||
CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
|
|
||||||
said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
|
|
||||||
since the data processor for each task inherits from the base class DataProcessor.
|
|
||||||
|
|
||||||
## Running on TPUs in PyTorch
|
|
||||||
|
|
||||||
Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on
|
|
||||||
how to setup your TPU environment refer to the
|
|
||||||
[pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
|
|
||||||
|
|
||||||
For running your GLUE task on MNLI dataset you can run something like the following form the root of the transformers
|
|
||||||
repo:
|
|
||||||
|
|
||||||
```
|
|
||||||
python examples/xla_spawn.py \
|
|
||||||
--num_cores=8 \
|
|
||||||
transformers/examples/text-classification/run_glue.py \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--task_name=mrpc \
|
|
||||||
--num_train_epochs=3 \
|
|
||||||
--max_seq_length=128 \
|
|
||||||
--learning_rate=5e-5 \
|
|
||||||
--output_dir=/tmp/mrpc \
|
|
||||||
--overwrite_output_dir \
|
|
||||||
--logging_steps=5 \
|
|
||||||
--save_steps=5 \
|
|
||||||
--tpu_metrics_debug \
|
|
||||||
--model_name_or_path=bert-base-cased \
|
|
||||||
--per_device_train_batch_size=64 \
|
|
||||||
--per_device_eval_batch_size=64
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
#### Using Apex and mixed-precision
|
|
||||||
|
|
||||||
Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
|
|
||||||
[apex](https://github.com/NVIDIA/apex), then run the following example:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
|
|
||||||
python run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name MRPC \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_device_train_batch_size 32 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/mrpc_output/ \
|
|
||||||
--fp16
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Distributed training
|
|
||||||
|
|
||||||
Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
|
|
||||||
reaches F1 > 92 on MRPC.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
|
|
||||||
python -m torch.distributed.launch \
|
|
||||||
--nproc_per_node 8 run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name mrpc \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_device_train_batch_size 8 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir /tmp/mrpc_output/
|
|
||||||
```
|
|
||||||
|
|
||||||
Training with these hyper-parameters gave us the following results:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
acc = 0.8823529411764706
|
|
||||||
acc_and_f1 = 0.901702786377709
|
|
||||||
eval_loss = 0.3418912578906332
|
|
||||||
f1 = 0.9210526315789473
|
|
||||||
global_step = 174
|
|
||||||
loss = 0.07231863956341798
|
|
||||||
```
|
|
||||||
|
|
||||||
### MNLI
|
|
||||||
|
|
||||||
The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export GLUE_DIR=/path/to/glue
|
|
||||||
|
|
||||||
python -m torch.distributed.launch \
|
|
||||||
--nproc_per_node 8 run_glue.py \
|
|
||||||
--model_name_or_path bert-base-cased \
|
|
||||||
--task_name mnli \
|
|
||||||
--do_train \
|
|
||||||
--do_eval \
|
|
||||||
--max_seq_length 128 \
|
|
||||||
--per_device_train_batch_size 8 \
|
|
||||||
--learning_rate 2e-5 \
|
|
||||||
--num_train_epochs 3.0 \
|
|
||||||
--output_dir output_dir \
|
|
||||||
```
|
|
||||||
|
|
||||||
The results are the following:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
***** Eval results *****
|
|
||||||
acc = 0.8679706601466992
|
|
||||||
eval_loss = 0.4911287787382479
|
|
||||||
global_step = 18408
|
|
||||||
loss = 0.04755385363816904
|
|
||||||
|
|
||||||
***** Eval results *****
|
|
||||||
acc = 0.8747965825874695
|
|
||||||
eval_loss = 0.45516540421714036
|
|
||||||
global_step = 18408
|
|
||||||
loss = 0.04755385363816904
|
|
||||||
```
|
|
||||||
|
|
||||||
# Run PyTorch version using PyTorch-Lightning
|
|
||||||
|
|
||||||
Run `bash run_pl.sh` from the `glue` directory. This will also install `pytorch-lightning` and the requirements in
|
|
||||||
`examples/requirements.txt`. It is a shell pipeline that will automatically download, preprocess the data and run the
|
|
||||||
specified models. Logs are saved in `lightning_logs` directory.
|
|
||||||
|
|
||||||
Pass `--gpus` flag to change the number of GPUs. Default uses 1. At the end, the expected results are:
|
|
||||||
|
|
||||||
```
|
|
||||||
TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
# XNLI
|
|
||||||
|
|
||||||
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
|
Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user