Merge branch 'master' into master
This commit is contained in:
@@ -5,11 +5,36 @@ similar API between the different models.
|
||||
|
||||
| Section | Description |
|
||||
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks.
|
||||
| [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
|
||||
| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
|
||||
| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
|
||||
| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
|
||||
| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks.
|
||||
| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
|
||||
| [Abstractive summarization](#abstractive-summarization) | Fine-tuning the library models for abstractive summarization tasks on the CNN/Daily Mail dataset. |
|
||||
|
||||
## TensorFlow 2.0 Bert models on GLUE
|
||||
|
||||
Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
|
||||
|
||||
Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
|
||||
|
||||
This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
|
||||
Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
|
||||
These options and the below benchmark are provided by @tlkh.
|
||||
|
||||
Quick benchmarks from the script (no other modifications):
|
||||
|
||||
| GPU | Mode | Time (2nd epoch) | Val Acc (3 runs) |
|
||||
| --------- | -------- | ----------------------- | ----------------------|
|
||||
| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
|
||||
| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
|
||||
| V100 | FP32 | 35s | 0.8646/0.8359/0.8464 |
|
||||
| V100 | AMP | 22s | 0.8646/0.8385/0.8411 |
|
||||
| 1080 Ti | FP32 | 55s | - |
|
||||
|
||||
Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
|
||||
|
||||
## Language model fine-tuning
|
||||
|
||||
@@ -77,7 +102,7 @@ python run_lm_finetuning.py \
|
||||
|
||||
Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
|
||||
|
||||
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.
|
||||
Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
|
||||
A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
|
||||
can try out the different models available in the library.
|
||||
|
||||
@@ -387,7 +412,7 @@ f1 = 93.15
|
||||
exact_match = 86.91
|
||||
```
|
||||
|
||||
This fine-tuneds model is available as a checkpoint under the reference
|
||||
This fine-tuned model is available as a checkpoint under the reference
|
||||
`bert-large-uncased-whole-word-masking-finetuned-squad`.
|
||||
|
||||
#### Fine-tuning XLNet on SQuAD
|
||||
@@ -427,3 +452,132 @@ Training with the previously defined hyper-parameters yields the following resul
|
||||
"HasAns_total": 10570
|
||||
}
|
||||
```
|
||||
|
||||
## Named Entity Recognition
|
||||
|
||||
Based on the script [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py).
|
||||
This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
|
||||
Details and results for the fine-tuning provided by @stefan-it.
|
||||
|
||||
### Data (Download and pre-processing steps)
|
||||
|
||||
Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
|
||||
|
||||
Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
|
||||
|
||||
```bash
|
||||
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' \
|
||||
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
|
||||
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' \
|
||||
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
|
||||
curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
|
||||
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
|
||||
```
|
||||
|
||||
The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`. One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s. I wrote a script that a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
|
||||
|
||||
```bash
|
||||
wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
|
||||
```
|
||||
Let's define some variables that we need for further pre-processing steps and training the model:
|
||||
|
||||
```bash
|
||||
export MAX_LENGTH=128
|
||||
export BERT_MODEL=bert-base-multilingual-cased
|
||||
```
|
||||
|
||||
Run the pre-processing script on training, dev and test datasets:
|
||||
|
||||
```bash
|
||||
python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
|
||||
python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
|
||||
python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
|
||||
```
|
||||
|
||||
The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
|
||||
|
||||
```bash
|
||||
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
|
||||
```
|
||||
|
||||
### Training
|
||||
|
||||
Additional environment variables must be set:
|
||||
|
||||
```bash
|
||||
export OUTPUT_DIR=germeval-model
|
||||
export BATCH_SIZE=32
|
||||
export NUM_EPOCHS=3
|
||||
export SAVE_STEPS=750
|
||||
export SEED=1
|
||||
```
|
||||
|
||||
To start training, just run:
|
||||
|
||||
```bash
|
||||
python3 run_ner.py --data_dir ./ \
|
||||
--model_type bert \
|
||||
--labels ./labels.txt \
|
||||
--model_name_or_path $BERT_MODEL \
|
||||
--output_dir $OUTPUT_DIR \
|
||||
--max_seq_length $MAX_LENGTH \
|
||||
--num_train_epochs $NUM_EPOCHS \
|
||||
--per_gpu_train_batch_size $BATCH_SIZE \
|
||||
--save_steps $SAVE_STEPS \
|
||||
--seed $SEED \
|
||||
--do_train \
|
||||
--do_eval \
|
||||
--do_predict
|
||||
```
|
||||
|
||||
If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
|
||||
|
||||
### Evaluation
|
||||
|
||||
Evaluation on development dataset outputs the following for our example:
|
||||
|
||||
```bash
|
||||
10/04/2019 00:42:06 - INFO - __main__ - ***** Eval results *****
|
||||
10/04/2019 00:42:06 - INFO - __main__ - f1 = 0.8623348017621146
|
||||
10/04/2019 00:42:06 - INFO - __main__ - loss = 0.07183869666975543
|
||||
10/04/2019 00:42:06 - INFO - __main__ - precision = 0.8467916366258111
|
||||
10/04/2019 00:42:06 - INFO - __main__ - recall = 0.8784592370979806
|
||||
```
|
||||
|
||||
On the test dataset the following results could be achieved:
|
||||
|
||||
```bash
|
||||
10/04/2019 00:42:42 - INFO - __main__ - ***** Eval results *****
|
||||
10/04/2019 00:42:42 - INFO - __main__ - f1 = 0.8614389652384803
|
||||
10/04/2019 00:42:42 - INFO - __main__ - loss = 0.07064602487454782
|
||||
10/04/2019 00:42:42 - INFO - __main__ - precision = 0.8604651162790697
|
||||
10/04/2019 00:42:42 - INFO - __main__ - recall = 0.8624150210424085
|
||||
```
|
||||
|
||||
## Abstractive summarization
|
||||
|
||||
Based on the script
|
||||
[`run_summarization_finetuning.py`](https://github.com/huggingface/transformers/blob/master/examples/run_summarization_finetuning.py).
|
||||
|
||||
Before running this script you should download **both** CNN and Daily Mail
|
||||
datasets from [Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the
|
||||
links next to "Stories") in the same folder. Then uncompress the archives by running:
|
||||
|
||||
```bash
|
||||
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
|
||||
```
|
||||
|
||||
note that the finetuning script **will not work** if you do not download both
|
||||
datasets. We will refer as `$DATA_PATH` the path to where you uncompressed both
|
||||
archive.
|
||||
|
||||
```bash
|
||||
export DATA_PATH=/path/to/dataset/
|
||||
|
||||
python run_summarization_finetuning.py \
|
||||
--output_dir=output \
|
||||
--model_type=bert2bert \
|
||||
--model_name_or_path=bert2bert \
|
||||
--do_train \
|
||||
--data_path=$DATA_PATH \
|
||||
```
|
||||
|
||||
477
examples/benchmarks.py
Normal file
477
examples/benchmarks.py
Normal file
@@ -0,0 +1,477 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Benchmarking the library on inference and training """
|
||||
|
||||
# If checking the tensors placement
|
||||
# tf.debugging.set_log_device_placement(True)
|
||||
|
||||
from typing import List
|
||||
import timeit
|
||||
from transformers import is_tf_available, is_torch_available
|
||||
from time import time
|
||||
import argparse
|
||||
import csv
|
||||
|
||||
if is_tf_available():
|
||||
import tensorflow as tf
|
||||
from transformers import TFAutoModel
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from transformers import AutoModel
|
||||
|
||||
from transformers import AutoConfig, AutoTokenizer
|
||||
|
||||
input_text = """Bent over their instruments, three hundred Fertilizers were plunged, as
|
||||
the Director of Hatcheries and Conditioning entered the room, in the
|
||||
|
||||
|
||||
|
||||
scarcely breathing silence, the absent-minded, soliloquizing hum or
|
||||
whistle, of absorbed concentration. A troop of newly arrived students,
|
||||
very young, pink and callow, followed nervously, rather abjectly, at the
|
||||
Director's heels. Each of them carried a notebook, in which, whenever
|
||||
the great man spoke, he desperately scribbled. Straight from the
|
||||
horse's mouth. It was a rare privilege. The D. H. C. for Central London
|
||||
always made a point of personally conducting his new students round
|
||||
the various departments.
|
||||
|
||||
"Just to give you a general idea," he would explain to them. For of
|
||||
course some sort of general idea they must have, if they were to do
|
||||
their work intelligently-though as little of one, if they were to be good
|
||||
and happy members of society, as possible. For particulars, as every
|
||||
one knows, make for virtue and happiness; generalities are intellectu-
|
||||
ally necessary evils. Not philosophers but fret-sawyers and stamp col-
|
||||
lectors compose the backbone of society.
|
||||
|
||||
"To-morrow," he would add, smiling at them with a slightly menacing
|
||||
geniality, "you'll be settling down to serious work. You won't have time
|
||||
for generalities. Meanwhile ..."
|
||||
|
||||
Meanwhile, it was a privilege. Straight from the horse's mouth into the
|
||||
notebook. The boys scribbled like mad.
|
||||
|
||||
Tall and rather thin but upright, the Director advanced into the room.
|
||||
He had a long chin and big rather prominent teeth, just covered, when
|
||||
he was not talking, by his full, floridly curved lips. Old, young? Thirty?
|
||||
Fifty? Fifty-five? It was hard to say. And anyhow the question didn't
|
||||
arise; in this year of stability, A. F. 632, it didn't occur to you to ask it.
|
||||
|
||||
"I shall begin at the beginning," said the D.H.C. and the more zealous
|
||||
students recorded his intention in their notebooks: Begin at the begin-
|
||||
ning. "These," he waved his hand, "are the incubators." And opening
|
||||
an insulated door he showed them racks upon racks of numbered test-
|
||||
tubes. "The week's supply of ova. Kept," he explained, "at blood heat;
|
||||
whereas the male gametes," and here he opened another door, "they
|
||||
have to be kept at thirty-five instead of thirty-seven. Full blood heat
|
||||
sterilizes." Rams wrapped in theremogene beget no lambs.
|
||||
|
||||
Still leaning against the incubators he gave them, while the pencils
|
||||
scurried illegibly across the pages, a brief description of the modern
|
||||
|
||||
|
||||
|
||||
fertilizing process; spoke first, of course, of its surgical introduc-
|
||||
tion-"the operation undergone voluntarily for the good of Society, not
|
||||
to mention the fact that it carries a bonus amounting to six months'
|
||||
salary"; continued with some account of the technique for preserving
|
||||
the excised ovary alive and actively developing; passed on to a consid-
|
||||
eration of optimum temperature, salinity, viscosity; referred to the liq-
|
||||
uor in which the detached and ripened eggs were kept; and, leading
|
||||
his charges to the work tables, actually showed them how this liquor
|
||||
was drawn off from the test-tubes; how it was let out drop by drop
|
||||
onto the specially warmed slides of the microscopes; how the eggs
|
||||
which it contained were inspected for abnormalities, counted and
|
||||
transferred to a porous receptacle; how (and he now took them to
|
||||
watch the operation) this receptacle was immersed in a warm bouillon
|
||||
containing free-swimming spermatozoa-at a minimum concentration
|
||||
of one hundred thousand per cubic centimetre, he insisted; and how,
|
||||
after ten minutes, the container was lifted out of the liquor and its
|
||||
contents re-examined; how, if any of the eggs remained unfertilized, it
|
||||
was again immersed, and, if necessary, yet again; how the fertilized
|
||||
ova went back to the incubators; where the Alphas and Betas re-
|
||||
mained until definitely bottled; while the Gammas, Deltas and Epsilons
|
||||
were brought out again, after only thirty-six hours, to undergo Bo-
|
||||
kanovsky's Process.
|
||||
|
||||
"Bokanovsky's Process," repeated the Director, and the students un-
|
||||
derlined the words in their little notebooks.
|
||||
|
||||
One egg, one embryo, one adult-normality. But a bokanovskified egg
|
||||
will bud, will proliferate, will divide. From eight to ninety-six buds, and
|
||||
every bud will grow into a perfectly formed embryo, and every embryo
|
||||
into a full-sized adult. Making ninety-six human beings grow where
|
||||
only one grew before. Progress.
|
||||
|
||||
"Essentially," the D.H.C. concluded, "bokanovskification consists of a
|
||||
series of arrests of development. We check the normal growth and,
|
||||
paradoxically enough, the egg responds by budding."
|
||||
|
||||
Responds by budding. The pencils were busy.
|
||||
|
||||
He pointed. On a very slowly moving band a rack-full of test-tubes was
|
||||
entering a large metal box, another, rack-full was emerging. Machinery
|
||||
faintly purred. It took eight minutes for the tubes to go through, he
|
||||
|
||||
|
||||
|
||||
told them. Eight minutes of hard X-rays being about as much as an
|
||||
egg can stand. A few died; of the rest, the least susceptible divided
|
||||
into two; most put out four buds; some eight; all were returned to the
|
||||
incubators, where the buds began to develop; then, after two days,
|
||||
were suddenly chilled, chilled and checked. Two, four, eight, the buds
|
||||
in their turn budded; and having budded were dosed almost to death
|
||||
with alcohol; consequently burgeoned again and having budded-bud
|
||||
out of bud out of bud-were thereafter-further arrest being generally
|
||||
fatal-left to develop in peace. By which time the original egg was in a
|
||||
fair way to becoming anything from eight to ninety-six embryos- a
|
||||
prodigious improvement, you will agree, on nature. Identical twins-but
|
||||
not in piddling twos and threes as in the old viviparous days, when an
|
||||
egg would sometimes accidentally divide; actually by dozens, by
|
||||
scores at a time.
|
||||
|
||||
"Scores," the Director repeated and flung out his arms, as though he
|
||||
were distributing largesse. "Scores."
|
||||
|
||||
But one of the students was fool enough to ask where the advantage
|
||||
lay.
|
||||
|
||||
"My good boy!" The Director wheeled sharply round on him. "Can't you
|
||||
see? Can't you see?" He raised a hand; his expression was solemn.
|
||||
"Bokanovsky's Process is one of the major instruments of social stabil-
|
||||
ity!"
|
||||
|
||||
Major instruments of social stability.
|
||||
|
||||
Standard men and women; in uniform batches. The whole of a small
|
||||
factory staffed with the products of a single bokanovskified egg.
|
||||
|
||||
"Ninety-six identical twins working ninety-six identical machines!" The
|
||||
voice was almost tremulous with enthusiasm. "You really know where
|
||||
you are. For the first time in history." He quoted the planetary motto.
|
||||
"Community, Identity, Stability." Grand words. "If we could bo-
|
||||
kanovskify indefinitely the whole problem would be solved."
|
||||
|
||||
Solved by standard Gammas, unvarying Deltas, uniform Epsilons. Mil-
|
||||
lions of identical twins. The principle of mass production at last applied
|
||||
to biology.
|
||||
|
||||
|
||||
|
||||
"But, alas," the Director shook his head, "we can't bokanovskify indefi-
|
||||
nitely."
|
||||
|
||||
Ninety-six seemed to be the limit; seventy-two a good average. From
|
||||
the same ovary and with gametes of the same male to manufacture as
|
||||
many batches of identical twins as possible-that was the best (sadly a
|
||||
second best) that they could do. And even that was difficult.
|
||||
|
||||
"For in nature it takes thirty years for two hundred eggs to reach ma-
|
||||
turity. But our business is to stabilize the population at this moment,
|
||||
here and now. Dribbling out twins over a quarter of a century-what
|
||||
would be the use of that?"
|
||||
|
||||
Obviously, no use at all. But Podsnap's Technique had immensely ac-
|
||||
celerated the process of ripening. They could make sure of at least a
|
||||
hundred and fifty mature eggs within two years. Fertilize and bo-
|
||||
kanovskify-in other words, multiply by seventy-two-and you get an
|
||||
average of nearly eleven thousand brothers and sisters in a hundred
|
||||
and fifty batches of identical twins, all within two years of the same
|
||||
age.
|
||||
|
||||
"And in exceptional cases we can make one ovary yield us over fifteen
|
||||
thousand adult individuals."
|
||||
|
||||
Beckoning to a fair-haired, ruddy young man who happened to be
|
||||
passing at the moment. "Mr. Foster," he called. The ruddy young man
|
||||
approached. "Can you tell us the record for a single ovary, Mr. Foster?"
|
||||
|
||||
"Sixteen thousand and twelve in this Centre," Mr. Foster replied with-
|
||||
out hesitation. He spoke very quickly, had a vivacious blue eye, and
|
||||
took an evident pleasure in quoting figures. "Sixteen thousand and
|
||||
twelve; in one hundred and eighty-nine batches of identicals. But of
|
||||
course they've done much better," he rattled on, "in some of the tropi-
|
||||
cal Centres. Singapore has often produced over sixteen thousand five
|
||||
hundred; and Mombasa has actually touched the seventeen thousand
|
||||
mark. But then they have unfair advantages. You should see the way a
|
||||
negro ovary responds to pituitary! It's quite astonishing, when you're
|
||||
used to working with European material. Still," he added, with a laugh
|
||||
(but the light of combat was in his eyes and the lift of his chin was
|
||||
challenging), "still, we mean to beat them if we can. I'm working on a
|
||||
wonderful Delta-Minus ovary at this moment. Only just eighteen
|
||||
|
||||
|
||||
|
||||
months old. Over twelve thousand seven hundred children already, ei-
|
||||
ther decanted or in embryo. And still going strong. We'll beat them
|
||||
yet."
|
||||
|
||||
"That's the spirit I like!" cried the Director, and clapped Mr. Foster on
|
||||
the shoulder. "Come along with us, and give these boys the benefit of
|
||||
your expert knowledge."
|
||||
|
||||
Mr. Foster smiled modestly. "With pleasure." They went.
|
||||
In the Bottling Room all was harmonious bustle and ordered activity.
|
||||
Flaps of fresh sow's peritoneum ready cut to the proper size came
|
||||
shooting up in little lifts from the Organ Store in the sub-basement.
|
||||
Whizz and then, click! the lift-hatches hew open; the bottle-liner had
|
||||
only to reach out a hand, take the flap, insert, smooth-down, and be-
|
||||
fore the lined bottle had had time to travel out of reach along the end-
|
||||
less band, whizz, click! another flap of peritoneum had shot up from
|
||||
the depths, ready to be slipped into yet another bottle, the next of that
|
||||
slow interminable procession on the band.
|
||||
|
||||
Next to the Liners stood the Matriculators. The procession advanced;
|
||||
one by one the eggs were transferred from their test-tubes to the
|
||||
larger containers; deftly the peritoneal lining was slit, the morula
|
||||
dropped into place, the saline solution poured in ... and already the
|
||||
bottle had passed, and it was the turn of the labellers. Heredity, date
|
||||
of fertilization, membership of Bokanovsky Group-details were trans-
|
||||
ferred from test-tube to bottle. No longer anonymous, but named,
|
||||
identified, the procession marched slowly on; on through an opening in
|
||||
the wall, slowly on into the Social Predestination Room.
|
||||
"Eighty-eight cubic metres of card-index," said Mr. Foster with relish,
|
||||
as they entered."""
|
||||
|
||||
|
||||
def create_setup_and_compute(model_names: List[str],
|
||||
gpu: bool = True,
|
||||
tensorflow: bool = False,
|
||||
average_over: int = 3,
|
||||
torchscript: bool = False,
|
||||
xla: bool = False,
|
||||
amp: bool = False,
|
||||
fp16: bool = False,
|
||||
save_to_csv: bool = False,
|
||||
csv_filename: str = f"results_{round(time())}.csv"):
|
||||
if xla:
|
||||
tf.config.optimizer.set_jit(True)
|
||||
if amp:
|
||||
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})
|
||||
|
||||
if tensorflow:
|
||||
dictionary = {model_name: {} for model_name in model_names}
|
||||
results = _compute_tensorflow(model_names, dictionary, average_over, amp)
|
||||
else:
|
||||
device = 'cuda' if (gpu and torch.cuda.is_available()) else 'cpu'
|
||||
dictionary = {model_name: {} for model_name in model_names}
|
||||
results = _compute_pytorch(model_names, dictionary, average_over, device, torchscript, fp16)
|
||||
|
||||
print("=========== RESULTS ===========")
|
||||
for model_name in model_names:
|
||||
print("\t" + f"======= MODEL CHECKPOINT: {model_name} =======")
|
||||
for batch_size in results[model_name]["bs"]:
|
||||
print("\t\t" + f"===== BATCH SIZE: {batch_size} =====")
|
||||
for slice_size in results[model_name]["ss"]:
|
||||
result = results[model_name]['results'][batch_size][slice_size]
|
||||
if isinstance(result, str):
|
||||
print(f"\t\t{model_name}/{batch_size}/{slice_size}: "
|
||||
f"{result}")
|
||||
else:
|
||||
print(f"\t\t{model_name}/{batch_size}/{slice_size}: "
|
||||
f"{(round(1000 * result) / 1000)}"
|
||||
f"s")
|
||||
|
||||
if save_to_csv:
|
||||
with open(csv_filename, mode='w') as csv_file:
|
||||
fieldnames = ['model',
|
||||
'1x8', '1x64', '1x128', '1x256', '1x512', '1x1024',
|
||||
'2x8', '2x64', '2x128', '2x256', '2x512', '2x1024',
|
||||
'4x8', '4x64', '4x128', '4x256', '4x512', '4x1024',
|
||||
'8x8', '8x64', '8x128', '8x256', '8x512', '8x1024',
|
||||
]
|
||||
|
||||
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
|
||||
writer.writeheader()
|
||||
|
||||
for model_name in model_names:
|
||||
model_results = {
|
||||
f'{bs}x{ss}': results[model_name]['results'][bs][ss]
|
||||
for bs in results[model_name]["results"]
|
||||
for ss in results[model_name]['results'][bs]
|
||||
}
|
||||
writer.writerow({'model': model_name, **model_results})
|
||||
|
||||
|
||||
def _compute_pytorch(model_names, dictionary, average_over, device, torchscript, fp16):
|
||||
for c, model_name in enumerate(model_names):
|
||||
print(f"{c + 1} / {len(model_names)}")
|
||||
config = AutoConfig.from_pretrained(model_name, torchscript=torchscript)
|
||||
model = AutoModel.from_pretrained(model_name, config=config)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
|
||||
tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
|
||||
|
||||
max_input_size = tokenizer.max_model_input_sizes[model_name]
|
||||
batch_sizes = [1, 2, 4, 8]
|
||||
slice_sizes = [8, 64, 128, 256, 512, 1024]
|
||||
|
||||
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "results": {}}
|
||||
dictionary[model_name]["results"] = {i: {} for i in batch_sizes}
|
||||
|
||||
for batch_size in batch_sizes:
|
||||
if fp16:
|
||||
model.half()
|
||||
model.to(device)
|
||||
model.eval()
|
||||
for slice_size in slice_sizes:
|
||||
if max_input_size is not None and slice_size > max_input_size:
|
||||
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||
else:
|
||||
sequence = torch.tensor(tokenized_sequence[:slice_size], device=device).repeat(batch_size, 1)
|
||||
try:
|
||||
if torchscript:
|
||||
print("Tracing model with sequence size", sequence.shape)
|
||||
inference = torch.jit.trace(model, sequence)
|
||||
inference(sequence)
|
||||
else:
|
||||
inference = model
|
||||
inference(sequence)
|
||||
|
||||
print("Going through model with sequence of shape", sequence.shape)
|
||||
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
|
||||
average_time = sum(runtimes)/float(len(runtimes)) / 3.0
|
||||
dictionary[model_name]["results"][batch_size][slice_size] = average_time
|
||||
except RuntimeError as e:
|
||||
print("Doesn't fit on GPU.", e)
|
||||
torch.cuda.empty_cache()
|
||||
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||
return dictionary
|
||||
|
||||
|
||||
def _compute_tensorflow(model_names, dictionary, average_over, amp):
|
||||
for c, model_name in enumerate(model_names):
|
||||
print(f"{c + 1} / {len(model_names)}")
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
model = TFAutoModel.from_pretrained(model_name, config=config)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
|
||||
tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
|
||||
|
||||
max_input_size = tokenizer.max_model_input_sizes[model_name]
|
||||
batch_sizes = [1, 2, 4, 8]
|
||||
slice_sizes = [8, 64, 128, 256, 512, 1024]
|
||||
|
||||
dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "results": {}}
|
||||
dictionary[model_name]["results"] = {i: {} for i in batch_sizes}
|
||||
|
||||
print("Using model", model)
|
||||
|
||||
@tf.function
|
||||
def inference(inputs):
|
||||
return model(inputs)
|
||||
|
||||
for batch_size in batch_sizes:
|
||||
for slice_size in slice_sizes:
|
||||
if max_input_size is not None and slice_size > max_input_size:
|
||||
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||
else:
|
||||
sequence = tf.stack([tf.squeeze(tf.constant(tokenized_sequence[:slice_size])[None, :])] * batch_size)
|
||||
|
||||
try:
|
||||
print("Going through model with sequence of shape", sequence.shape)
|
||||
# To make sure that the model is traced + that the tensors are on the appropriate device
|
||||
inference(sequence)
|
||||
|
||||
runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
|
||||
average_time = sum(runtimes)/float(len(runtimes)) / 3.0
|
||||
dictionary[model_name]["results"][batch_size][slice_size] = average_time
|
||||
except tf.errors.ResourceExhaustedError as e:
|
||||
print("Doesn't fit on GPU.", e)
|
||||
torch.cuda.empty_cache()
|
||||
dictionary[model_name]["results"][batch_size][slice_size] = "N/A"
|
||||
return dictionary
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
parser.add_argument("--models", required=False, type=str, default='all', help="Model checkpoints to be provided "
|
||||
"to the AutoModel classes. Leave "
|
||||
"blank to benchmark the base version "
|
||||
"of all available model "
|
||||
"architectures.")
|
||||
parser.add_argument("--torch", required=False, action="store_true", help="Benchmark the Pytorch version of the "
|
||||
"models")
|
||||
parser.add_argument("--torch_cuda", required=False, action="store_true", help="Pytorch only: run on available "
|
||||
"cuda devices")
|
||||
parser.add_argument("--torchscript", required=False, action="store_true", help="Pytorch only: trace the models "
|
||||
"using torchscript")
|
||||
parser.add_argument("--tensorflow", required=False, action="store_true", help="Benchmark the TensorFlow version "
|
||||
"of the models. Will run on GPU if "
|
||||
"the correct dependencies are "
|
||||
"installed")
|
||||
parser.add_argument("--xla", required=False, action="store_true", help="TensorFlow only: use XLA acceleration.")
|
||||
parser.add_argument("--amp", required=False, action="store_true", help="TensorFlow only: use automatic mixed precision acceleration.")
|
||||
parser.add_argument("--fp16", required=False, action="store_true", help="PyTorch only: use FP16 to accelerate inference.")
|
||||
parser.add_argument("--keras_predict", required=False, action="store_true", help="Whether to use model.predict "
|
||||
"instead of model() to do a "
|
||||
"forward pass.")
|
||||
parser.add_argument("--save_to_csv", required=False, action="store_true", help="Save to a CSV file.")
|
||||
parser.add_argument("--csv_filename", required=False, default=None, help="CSV filename used if saving results to csv.")
|
||||
parser.add_argument("--average_over", required=False, default=30, type=int, help="Times an experiment will be run.")
|
||||
|
||||
args = parser.parse_args()
|
||||
if args.models == 'all':
|
||||
args.models = [
|
||||
"gpt2",
|
||||
"bert-base-cased",
|
||||
"xlnet-base-cased",
|
||||
"xlm-mlm-en-2048",
|
||||
"transfo-xl-wt103",
|
||||
"openai-gpt",
|
||||
"distilbert-base-uncased",
|
||||
"distilgpt2",
|
||||
"roberta-base",
|
||||
"ctrl"
|
||||
]
|
||||
else:
|
||||
args.models = args.models.split()
|
||||
|
||||
print("Running with arguments", args)
|
||||
|
||||
if args.torch:
|
||||
if is_torch_available():
|
||||
create_setup_and_compute(
|
||||
model_names=args.models,
|
||||
tensorflow=False,
|
||||
gpu=args.torch_cuda,
|
||||
torchscript=args.torchscript,
|
||||
fp16=args.fp16,
|
||||
save_to_csv=args.save_to_csv,
|
||||
csv_filename=args.csv_filename,
|
||||
average_over=args.average_over
|
||||
)
|
||||
else:
|
||||
raise ImportError("Trying to run a PyTorch benchmark but PyTorch was not found in the environment.")
|
||||
|
||||
if args.tensorflow:
|
||||
if is_tf_available():
|
||||
create_setup_and_compute(
|
||||
model_names=args.models,
|
||||
tensorflow=True,
|
||||
xla=args.xla,
|
||||
amp=args.amp,
|
||||
save_to_csv=args.save_to_csv,
|
||||
csv_filename=args.csv_filename,
|
||||
average_over=args.average_over
|
||||
)
|
||||
else:
|
||||
raise ImportError("Trying to run a TensorFlow benchmark but TensorFlow was not found in the environment.")
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
@@ -1,25 +1,40 @@
|
||||
# Distil*
|
||||
|
||||
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT and DistilGPT2.
|
||||
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2.
|
||||
|
||||
**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.**
|
||||
**October 23rd, 2019 - Update** We release **DistilRoBERTa**: 95% of `RoBERTa-base`'s performance on GLUE, twice as fast as RoBERTa while being 35% smaller.
|
||||
|
||||
**October 3rd, 2019 - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.**
|
||||
|
||||
**September 19th, 2019 - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
|
||||
|
||||
**2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!
|
||||
|
||||
## What is Distil*
|
||||
|
||||
Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
|
||||
|
||||
We have applied the same method to GPT2 and release the weights of the compressed model. On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for DistilGPT2 (after fine-tuning on the train set).
|
||||
We have applied the same method to other Transformer architectures and released the weights:
|
||||
- GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for **DistilGPT2** (after fine-tuning on the train set).
|
||||
- RoBERTa: **DistilRoBERTa** reaches 95% of `RoBERTa-base` performance on GLUE while being twice faster and 35% smaller.
|
||||
- and more to come! 🤗🤗🤗
|
||||
|
||||
For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108).
|
||||
|
||||
Here are the results on the dev sets of GLUE:
|
||||
|
||||
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2| STS-B| WNLI |
|
||||
| :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:|
|
||||
| BERT-base | **77.6** | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
|
||||
| DistilBERT | **76.8** | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
|
||||
| Model | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2| STS-B| WNLI |
|
||||
| :---: | :---: | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---: |
|
||||
| BERT-base | **77.6** | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
|
||||
| DistilBERT | **76.8** | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
|
||||
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
||||
| RoBERTa-base (reported) | **83.2**/**86.4**<sup>2</sup> | 63.6 | 87.6 | 90.2 | 92.8 | 91.9 | 78.7 | 94.8 | 91.2 | 57.7<sup>3</sup> |
|
||||
| DistilRoBERTa<sup>1</sup> | **79.0**/**82.3**<sup>2</sup> | 59.4 | 83.9 | 86.6 | 90.8 | 89.4 | 67.9 | 92.5 | 88.3 | 52.1 |
|
||||
|
||||
<sup>1</sup> We did not use the MNLI checkpoint for fine-tuning but directy perform transfer learning on the pre-trained DistilRoBERTa.
|
||||
|
||||
<sup>2</sup> Macro-score computed without WNLI.
|
||||
|
||||
<sup>3</sup> We compute this score ourselves for completeness.
|
||||
|
||||
## Setup
|
||||
|
||||
@@ -27,13 +42,15 @@ This part of the library has only be tested with Python3.6+. There are few speci
|
||||
|
||||
**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0).
|
||||
|
||||
|
||||
## How to use DistilBERT
|
||||
|
||||
Transformers includes two pre-trained Distil* models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
|
||||
|
||||
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
|
||||
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
|
||||
- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset and . The model has 6 layers, 768 dimension and 12 heads, totalizing 82M (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
|
||||
- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
|
||||
- `distilroberta-base`: DistilRoBERTa English language model pretrained with the supervision of `roberta-base` solely on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base.
|
||||
- and more to come! 🤗🤗🤗
|
||||
|
||||
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
|
||||
@@ -47,7 +64,10 @@ outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
```
|
||||
|
||||
Similarly, using DistilGPT2 simply consists in calling the GPT2 classes from a different pretrained checkpoint: `model = GPT2Model.from_pretrained('distilgpt2')`.
|
||||
Similarly, using the other Distil* models simply consists in calling the base classes with a different pretrained checkpoint:
|
||||
- DistilGPT2: `model = GPT2Model.from_pretrained('distilgpt2')`
|
||||
- DistilRoBERTa: `model = RobertaModel.from_pretrained('distilroberta-base')`
|
||||
|
||||
|
||||
## How to train Distil*
|
||||
|
||||
@@ -88,7 +108,7 @@ python train.py \
|
||||
--student_config training_configs/distilbert-base-uncased.json \
|
||||
--teacher_type bert \
|
||||
--teacher_name bert-base-uncased \
|
||||
--alpha_ce 5.0 --alpha_mlm 2.0 --alpha_cos 1.0 --mlm \
|
||||
--alpha_ce 5.0 --alpha_mlm 2.0 --alpha_cos 1.0 --alpha_clm 0.0 --mlm \
|
||||
--freeze_pos_embs \
|
||||
--dump_path serialization_dir/my_first_training \
|
||||
--data_file data/binarized_text.bert-base-uncased.pickle \
|
||||
@@ -124,7 +144,7 @@ python -m torch.distributed.launch \
|
||||
--student_config training_configs/distilbert-base-uncased.json \
|
||||
--teacher_type bert \
|
||||
--teacher_name bert-base-uncased \
|
||||
--alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --mlm \
|
||||
--alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --alpha_clm 0.0 --mlm \
|
||||
--freeze_pos_embs \
|
||||
--dump_path serialization_dir/my_first_training \
|
||||
--data_file data/binarized_text.bert-base-uncased.pickle \
|
||||
@@ -146,4 +166,4 @@ If you find the ressource useful, you should cite the following paper:
|
||||
booktitle={NeurIPS EMC^2 Workshop},
|
||||
year={2019}
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
@@ -68,7 +68,7 @@ def main():
|
||||
start = time.time()
|
||||
for text in data:
|
||||
text = f'{bos} {text.strip()} {sep}'
|
||||
token_ids = tokenizer.encode(text)
|
||||
token_ids = tokenizer.encode(text, add_special_tokens=False)
|
||||
rslt.append(token_ids)
|
||||
|
||||
iter += 1
|
||||
|
||||
@@ -1,2 +1,4 @@
|
||||
tensorboardX
|
||||
scikit-learn
|
||||
tensorboard
|
||||
scikit-learn
|
||||
seqeval
|
||||
|
||||
@@ -79,13 +79,12 @@ def set_seed(args):
|
||||
def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
|
||||
""" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
|
||||
Args:
|
||||
logits: logits distribution shape (vocabulary size)
|
||||
logits: logits distribution shape (batch size x vocabulary size)
|
||||
top_k > 0: keep only top k tokens with highest probability (top-k filtering).
|
||||
top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
|
||||
Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
|
||||
From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
|
||||
"""
|
||||
assert logits.dim() == 1 # batch size 1 for now - could be updated for more but the code would be less clear
|
||||
top_k = min(top_k, logits.size(-1)) # Safety check
|
||||
if top_k > 0:
|
||||
# Remove all tokens with a probability less than the last token of the top-k
|
||||
@@ -102,7 +101,8 @@ def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')
|
||||
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
|
||||
sorted_indices_to_remove[..., 0] = 0
|
||||
|
||||
indices_to_remove = sorted_indices[sorted_indices_to_remove]
|
||||
# scatter sorted tensors to original indexing
|
||||
indices_to_remove = sorted_indices_to_remove.scatter(dim=1, index=sorted_indices, src=sorted_indices_to_remove)
|
||||
logits[indices_to_remove] = filter_value
|
||||
return logits
|
||||
|
||||
@@ -136,18 +136,19 @@ def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=
|
||||
inputs["langs"] = torch.tensor([xlm_lang] * inputs["input_ids"].shape[1], device=device).view(1, -1)
|
||||
|
||||
outputs = model(**inputs) # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet/CTRL (cached hidden-states)
|
||||
next_token_logits = outputs[0][0, -1, :] / (temperature if temperature > 0 else 1.)
|
||||
next_token_logits = outputs[0][:, -1, :] / (temperature if temperature > 0 else 1.)
|
||||
|
||||
# reptition penalty from CTRL (https://arxiv.org/abs/1909.05858)
|
||||
for _ in set(generated):
|
||||
next_token_logits[_] /= repetition_penalty
|
||||
# repetition penalty from CTRL (https://arxiv.org/abs/1909.05858)
|
||||
for i in range(num_samples):
|
||||
for _ in set(generated[i].tolist()):
|
||||
next_token_logits[i, _] /= repetition_penalty
|
||||
|
||||
filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
|
||||
if temperature == 0: #greedy sampling:
|
||||
next_token = torch.argmax(filtered_logits).unsqueeze(0)
|
||||
if temperature == 0: # greedy sampling:
|
||||
next_token = torch.argmax(filtered_logits, dim=-1).unsqueeze(-1)
|
||||
else:
|
||||
next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
|
||||
generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
|
||||
generated = torch.cat((generated, next_token), dim=1)
|
||||
return generated
|
||||
|
||||
|
||||
@@ -161,6 +162,7 @@ def main():
|
||||
parser.add_argument("--padding_text", type=str, default="")
|
||||
parser.add_argument("--xlm_lang", type=str, default="", help="Optional language when used with the XLM model.")
|
||||
parser.add_argument("--length", type=int, default=20)
|
||||
parser.add_argument("--num_samples", type=int, default=1)
|
||||
parser.add_argument("--temperature", type=float, default=1.0,
|
||||
help="temperature of 0 implies greedy sampling")
|
||||
parser.add_argument("--repetition_penalty", type=float, default=1.0,
|
||||
@@ -196,7 +198,7 @@ def main():
|
||||
|
||||
logger.info(args)
|
||||
if args.model_type in ["ctrl"]:
|
||||
if args.temperature > 0.7 :
|
||||
if args.temperature > 0.7:
|
||||
logger.info('CTRL typically works better with lower temperatures (and lower top_k).')
|
||||
|
||||
while True:
|
||||
@@ -223,10 +225,14 @@ def main():
|
||||
if args.model_type in ["transfo-xl", "xlnet"]:
|
||||
# Models with memory likes to have a long prompt for short inputs.
|
||||
raw_text = (args.padding_text if args.padding_text else PADDING_TEXT) + raw_text
|
||||
context_tokens = tokenizer.encode(raw_text)
|
||||
context_tokens = tokenizer.encode(raw_text, add_special_tokens=False)
|
||||
if args.model_type == "ctrl":
|
||||
if not any(context_tokens[0] == x for x in tokenizer.control_codes.values()):
|
||||
logger.info("WARNING! You are not starting your generation from a control code so you won't get good results")
|
||||
out = sample_sequence(
|
||||
model=model,
|
||||
context=context_tokens,
|
||||
num_samples=args.num_samples,
|
||||
length=args.length,
|
||||
temperature=args.temperature,
|
||||
top_k=args.top_k,
|
||||
@@ -238,12 +244,13 @@ def main():
|
||||
xlm_lang=xlm_lang,
|
||||
device=args.device,
|
||||
)
|
||||
out = out[0, len(context_tokens):].tolist()
|
||||
out = out[:, len(context_tokens):].tolist()
|
||||
for o in out:
|
||||
text = tokenizer.decode(o, clean_up_tokenization_spaces=True)
|
||||
text = text[: text.find(args.stop_token) if args.stop_token else None]
|
||||
|
||||
text = tokenizer.decode(out, clean_up_tokenization_spaces=True, skip_special_tokens=True)
|
||||
text = text[: text.find(args.stop_token) if args.stop_token else None]
|
||||
print(text)
|
||||
|
||||
print(text)
|
||||
if args.prompt:
|
||||
break
|
||||
return text
|
||||
|
||||
@@ -154,13 +154,16 @@ def train(args, train_dataset, model, tokenizer):
|
||||
if args.fp16:
|
||||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||
else:
|
||||
loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||
|
||||
tr_loss += loss.item()
|
||||
if (step + 1) % args.gradient_accumulation_steps == 0 and not args.tpu:
|
||||
if args.fp16:
|
||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||
else:
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||
|
||||
optimizer.step()
|
||||
scheduler.step() # Update learning rate schedule
|
||||
model.zero_grad()
|
||||
|
||||
@@ -309,10 +309,12 @@ def evaluate(args, model, tokenizer, prefix=""):
|
||||
model.eval()
|
||||
|
||||
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||
batch = batch.to(args.device)
|
||||
inputs, labels = mask_tokens(batch, tokenizer, args) if args.mlm else (batch, batch)
|
||||
inputs = inputs.to(args.device)
|
||||
labels = labels.to(args.device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(batch, masked_lm_labels=batch) if args.mlm else model(batch, labels=batch)
|
||||
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
|
||||
lm_loss = outputs[0]
|
||||
eval_loss += lm_loss.mean().item()
|
||||
nb_eval_steps += 1
|
||||
|
||||
518
examples/run_ner.py
Normal file
518
examples/run_ner.py
Normal file
@@ -0,0 +1,518 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Fine-tuning the library models for named entity recognition on CoNLL-2003 (Bert or Roberta). """
|
||||
|
||||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
import argparse
|
||||
import glob
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from seqeval.metrics import precision_score, recall_score, f1_score
|
||||
from tensorboardX import SummaryWriter
|
||||
from torch.nn import CrossEntropyLoss
|
||||
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
from tqdm import tqdm, trange
|
||||
from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
|
||||
|
||||
from transformers import AdamW, WarmupLinearSchedule
|
||||
from transformers import WEIGHTS_NAME, BertConfig, BertForTokenClassification, BertTokenizer
|
||||
from transformers import RobertaConfig, RobertaForTokenClassification, RobertaTokenizer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
ALL_MODELS = sum(
|
||||
(tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, RobertaConfig)),
|
||||
())
|
||||
|
||||
MODEL_CLASSES = {
|
||||
"bert": (BertConfig, BertForTokenClassification, BertTokenizer),
|
||||
"roberta": (RobertaConfig, RobertaForTokenClassification, RobertaTokenizer)
|
||||
}
|
||||
|
||||
|
||||
def set_seed(args):
|
||||
random.seed(args.seed)
|
||||
np.random.seed(args.seed)
|
||||
torch.manual_seed(args.seed)
|
||||
if args.n_gpu > 0:
|
||||
torch.cuda.manual_seed_all(args.seed)
|
||||
|
||||
|
||||
def train(args, train_dataset, model, tokenizer, labels, pad_token_label_id):
|
||||
""" Train the model """
|
||||
if args.local_rank in [-1, 0]:
|
||||
tb_writer = SummaryWriter()
|
||||
|
||||
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
|
||||
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
|
||||
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
|
||||
|
||||
if args.max_steps > 0:
|
||||
t_total = args.max_steps
|
||||
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
|
||||
else:
|
||||
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
|
||||
|
||||
# Prepare optimizer and schedule (linear warmup and decay)
|
||||
no_decay = ["bias", "LayerNorm.weight"]
|
||||
optimizer_grouped_parameters = [
|
||||
{"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
|
||||
"weight_decay": args.weight_decay},
|
||||
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}
|
||||
]
|
||||
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
||||
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
|
||||
if args.fp16:
|
||||
try:
|
||||
from apex import amp
|
||||
except ImportError:
|
||||
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
|
||||
|
||||
# multi-gpu training (should be after apex fp16 initialization)
|
||||
if args.n_gpu > 1:
|
||||
model = torch.nn.DataParallel(model)
|
||||
|
||||
# Distributed training (should be after apex fp16 initialization)
|
||||
if args.local_rank != -1:
|
||||
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
|
||||
output_device=args.local_rank,
|
||||
find_unused_parameters=True)
|
||||
|
||||
# Train!
|
||||
logger.info("***** Running training *****")
|
||||
logger.info(" Num examples = %d", len(train_dataset))
|
||||
logger.info(" Num Epochs = %d", args.num_train_epochs)
|
||||
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
|
||||
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
|
||||
args.train_batch_size * args.gradient_accumulation_steps * (
|
||||
torch.distributed.get_world_size() if args.local_rank != -1 else 1))
|
||||
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
||||
logger.info(" Total optimization steps = %d", t_total)
|
||||
|
||||
global_step = 0
|
||||
tr_loss, logging_loss = 0.0, 0.0
|
||||
model.zero_grad()
|
||||
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
|
||||
set_seed(args) # Added here for reproductibility (even between python 2 and 3)
|
||||
for _ in train_iterator:
|
||||
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
|
||||
for step, batch in enumerate(epoch_iterator):
|
||||
model.train()
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
inputs = {"input_ids": batch[0],
|
||||
"attention_mask": batch[1],
|
||||
"token_type_ids": batch[2] if args.model_type in ["bert", "xlnet"] else None,
|
||||
# XLM and RoBERTa don"t use segment_ids
|
||||
"labels": batch[3]}
|
||||
outputs = model(**inputs)
|
||||
loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
|
||||
|
||||
if args.n_gpu > 1:
|
||||
loss = loss.mean() # mean() to average on multi-gpu parallel training
|
||||
if args.gradient_accumulation_steps > 1:
|
||||
loss = loss / args.gradient_accumulation_steps
|
||||
|
||||
if args.fp16:
|
||||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
else:
|
||||
loss.backward()
|
||||
|
||||
tr_loss += loss.item()
|
||||
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||
if args.fp16:
|
||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||
else:
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||
|
||||
scheduler.step() # Update learning rate schedule
|
||||
optimizer.step()
|
||||
model.zero_grad()
|
||||
global_step += 1
|
||||
|
||||
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
|
||||
# Log metrics
|
||||
if args.local_rank == -1 and args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
|
||||
results, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id)
|
||||
for key, value in results.items():
|
||||
tb_writer.add_scalar("eval_{}".format(key), value, global_step)
|
||||
tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
|
||||
tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
|
||||
logging_loss = tr_loss
|
||||
|
||||
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
|
||||
# Save model checkpoint
|
||||
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
|
||||
if not os.path.exists(output_dir):
|
||||
os.makedirs(output_dir)
|
||||
model_to_save = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
|
||||
model_to_save.save_pretrained(output_dir)
|
||||
torch.save(args, os.path.join(output_dir, "training_args.bin"))
|
||||
logger.info("Saving model checkpoint to %s", output_dir)
|
||||
|
||||
if args.max_steps > 0 and global_step > args.max_steps:
|
||||
epoch_iterator.close()
|
||||
break
|
||||
if args.max_steps > 0 and global_step > args.max_steps:
|
||||
train_iterator.close()
|
||||
break
|
||||
|
||||
if args.local_rank in [-1, 0]:
|
||||
tb_writer.close()
|
||||
|
||||
return global_step, tr_loss / global_step
|
||||
|
||||
|
||||
def evaluate(args, model, tokenizer, labels, pad_token_label_id, mode, prefix=""):
|
||||
eval_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode=mode)
|
||||
|
||||
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
|
||||
# Note that DistributedSampler samples randomly
|
||||
eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
|
||||
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
|
||||
|
||||
# Eval!
|
||||
logger.info("***** Running evaluation %s *****", prefix)
|
||||
logger.info(" Num examples = %d", len(eval_dataset))
|
||||
logger.info(" Batch size = %d", args.eval_batch_size)
|
||||
eval_loss = 0.0
|
||||
nb_eval_steps = 0
|
||||
preds = None
|
||||
out_label_ids = None
|
||||
model.eval()
|
||||
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
|
||||
with torch.no_grad():
|
||||
inputs = {"input_ids": batch[0],
|
||||
"attention_mask": batch[1],
|
||||
"token_type_ids": batch[2] if args.model_type in ["bert", "xlnet"] else None,
|
||||
# XLM and RoBERTa don"t use segment_ids
|
||||
"labels": batch[3]}
|
||||
outputs = model(**inputs)
|
||||
tmp_eval_loss, logits = outputs[:2]
|
||||
|
||||
if args.n_gpu > 1:
|
||||
tmp_eval_loss = tmp_eval_loss.mean() # mean() to average on multi-gpu parallel evaluating
|
||||
|
||||
eval_loss += tmp_eval_loss.item()
|
||||
nb_eval_steps += 1
|
||||
if preds is None:
|
||||
preds = logits.detach().cpu().numpy()
|
||||
out_label_ids = inputs["labels"].detach().cpu().numpy()
|
||||
else:
|
||||
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
|
||||
out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
|
||||
|
||||
eval_loss = eval_loss / nb_eval_steps
|
||||
preds = np.argmax(preds, axis=2)
|
||||
|
||||
label_map = {i: label for i, label in enumerate(labels)}
|
||||
|
||||
out_label_list = [[] for _ in range(out_label_ids.shape[0])]
|
||||
preds_list = [[] for _ in range(out_label_ids.shape[0])]
|
||||
|
||||
for i in range(out_label_ids.shape[0]):
|
||||
for j in range(out_label_ids.shape[1]):
|
||||
if out_label_ids[i, j] != pad_token_label_id:
|
||||
out_label_list[i].append(label_map[out_label_ids[i][j]])
|
||||
preds_list[i].append(label_map[preds[i][j]])
|
||||
|
||||
results = {
|
||||
"loss": eval_loss,
|
||||
"precision": precision_score(out_label_list, preds_list),
|
||||
"recall": recall_score(out_label_list, preds_list),
|
||||
"f1": f1_score(out_label_list, preds_list)
|
||||
}
|
||||
|
||||
logger.info("***** Eval results %s *****", prefix)
|
||||
for key in sorted(results.keys()):
|
||||
logger.info(" %s = %s", key, str(results[key]))
|
||||
|
||||
return results, preds_list
|
||||
|
||||
|
||||
def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode):
|
||||
if args.local_rank not in [-1, 0] and not evaluate:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||
|
||||
# Load data features from cache or dataset file
|
||||
cached_features_file = os.path.join(args.data_dir, "cached_{}_{}_{}".format(mode,
|
||||
list(filter(None, args.model_name_or_path.split("/"))).pop(),
|
||||
str(args.max_seq_length)))
|
||||
if os.path.exists(cached_features_file) and not args.overwrite_cache:
|
||||
logger.info("Loading features from cached file %s", cached_features_file)
|
||||
features = torch.load(cached_features_file)
|
||||
else:
|
||||
logger.info("Creating features from dataset file at %s", args.data_dir)
|
||||
examples = read_examples_from_file(args.data_dir, mode)
|
||||
features = convert_examples_to_features(examples, labels, args.max_seq_length, tokenizer,
|
||||
cls_token_at_end=bool(args.model_type in ["xlnet"]),
|
||||
# xlnet has a cls token at the end
|
||||
cls_token=tokenizer.cls_token,
|
||||
cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0,
|
||||
sep_token=tokenizer.sep_token,
|
||||
sep_token_extra=bool(args.model_type in ["roberta"]),
|
||||
# roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
|
||||
pad_on_left=bool(args.model_type in ["xlnet"]),
|
||||
# pad on the left for xlnet
|
||||
pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
|
||||
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
|
||||
pad_token_label_id=pad_token_label_id
|
||||
)
|
||||
if args.local_rank in [-1, 0]:
|
||||
logger.info("Saving features into cached file %s", cached_features_file)
|
||||
torch.save(features, cached_features_file)
|
||||
|
||||
if args.local_rank == 0 and not evaluate:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
|
||||
|
||||
# Convert to Tensors and build dataset
|
||||
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
||||
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
|
||||
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
|
||||
all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
|
||||
|
||||
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
|
||||
return dataset
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
## Required parameters
|
||||
parser.add_argument("--data_dir", default=None, type=str, required=True,
|
||||
help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.")
|
||||
parser.add_argument("--model_type", default=None, type=str, required=True,
|
||||
help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
|
||||
parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
|
||||
help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
|
||||
parser.add_argument("--output_dir", default=None, type=str, required=True,
|
||||
help="The output directory where the model predictions and checkpoints will be written.")
|
||||
|
||||
## Other parameters
|
||||
parser.add_argument("--labels", default="", type=str,
|
||||
help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.")
|
||||
parser.add_argument("--config_name", default="", type=str,
|
||||
help="Pretrained config name or path if not the same as model_name")
|
||||
parser.add_argument("--tokenizer_name", default="", type=str,
|
||||
help="Pretrained tokenizer name or path if not the same as model_name")
|
||||
parser.add_argument("--cache_dir", default="", type=str,
|
||||
help="Where do you want to store the pre-trained models downloaded from s3")
|
||||
parser.add_argument("--max_seq_length", default=128, type=int,
|
||||
help="The maximum total input sequence length after tokenization. Sequences longer "
|
||||
"than this will be truncated, sequences shorter will be padded.")
|
||||
parser.add_argument("--do_train", action="store_true",
|
||||
help="Whether to run training.")
|
||||
parser.add_argument("--do_eval", action="store_true",
|
||||
help="Whether to run eval on the dev set.")
|
||||
parser.add_argument("--do_predict", action="store_true",
|
||||
help="Whether to run predictions on the test set.")
|
||||
parser.add_argument("--evaluate_during_training", action="store_true",
|
||||
help="Whether to run evaluation during training at each logging step.")
|
||||
parser.add_argument("--do_lower_case", action="store_true",
|
||||
help="Set this flag if you are using an uncased model.")
|
||||
|
||||
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
|
||||
help="Batch size per GPU/CPU for training.")
|
||||
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
|
||||
help="Batch size per GPU/CPU for evaluation.")
|
||||
parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
|
||||
help="Number of updates steps to accumulate before performing a backward/update pass.")
|
||||
parser.add_argument("--learning_rate", default=5e-5, type=float,
|
||||
help="The initial learning rate for Adam.")
|
||||
parser.add_argument("--weight_decay", default=0.0, type=float,
|
||||
help="Weight decay if we apply some.")
|
||||
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
|
||||
help="Epsilon for Adam optimizer.")
|
||||
parser.add_argument("--max_grad_norm", default=1.0, type=float,
|
||||
help="Max gradient norm.")
|
||||
parser.add_argument("--num_train_epochs", default=3.0, type=float,
|
||||
help="Total number of training epochs to perform.")
|
||||
parser.add_argument("--max_steps", default=-1, type=int,
|
||||
help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
|
||||
parser.add_argument("--warmup_steps", default=0, type=int,
|
||||
help="Linear warmup over warmup_steps.")
|
||||
|
||||
parser.add_argument("--logging_steps", type=int, default=50,
|
||||
help="Log every X updates steps.")
|
||||
parser.add_argument("--save_steps", type=int, default=50,
|
||||
help="Save checkpoint every X updates steps.")
|
||||
parser.add_argument("--eval_all_checkpoints", action="store_true",
|
||||
help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
|
||||
parser.add_argument("--no_cuda", action="store_true",
|
||||
help="Avoid using CUDA when available")
|
||||
parser.add_argument("--overwrite_output_dir", action="store_true",
|
||||
help="Overwrite the content of the output directory")
|
||||
parser.add_argument("--overwrite_cache", action="store_true",
|
||||
help="Overwrite the cached training and evaluation sets")
|
||||
parser.add_argument("--seed", type=int, default=42,
|
||||
help="random seed for initialization")
|
||||
|
||||
parser.add_argument("--fp16", action="store_true",
|
||||
help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
|
||||
parser.add_argument("--fp16_opt_level", type=str, default="O1",
|
||||
help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
|
||||
"See details at https://nvidia.github.io/apex/amp.html")
|
||||
parser.add_argument("--local_rank", type=int, default=-1,
|
||||
help="For distributed training: local_rank")
|
||||
parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
|
||||
parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
|
||||
args = parser.parse_args()
|
||||
|
||||
if os.path.exists(args.output_dir) and os.listdir(
|
||||
args.output_dir) and args.do_train and not args.overwrite_output_dir:
|
||||
raise ValueError(
|
||||
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
|
||||
args.output_dir))
|
||||
|
||||
# Setup distant debugging if needed
|
||||
if args.server_ip and args.server_port:
|
||||
# Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
|
||||
import ptvsd
|
||||
print("Waiting for debugger attach")
|
||||
ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
|
||||
ptvsd.wait_for_attach()
|
||||
|
||||
# Setup CUDA, GPU & distributed training
|
||||
if args.local_rank == -1 or args.no_cuda:
|
||||
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
|
||||
args.n_gpu = torch.cuda.device_count()
|
||||
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
|
||||
torch.cuda.set_device(args.local_rank)
|
||||
device = torch.device("cuda", args.local_rank)
|
||||
torch.distributed.init_process_group(backend="nccl")
|
||||
args.n_gpu = 1
|
||||
args.device = device
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
|
||||
datefmt="%m/%d/%Y %H:%M:%S",
|
||||
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
|
||||
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
|
||||
args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
|
||||
|
||||
# Set seed
|
||||
set_seed(args)
|
||||
|
||||
# Prepare CONLL-2003 task
|
||||
labels = get_labels(args.labels)
|
||||
num_labels = len(labels)
|
||||
# Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
|
||||
pad_token_label_id = CrossEntropyLoss().ignore_index
|
||||
|
||||
# Load pretrained model and tokenizer
|
||||
if args.local_rank not in [-1, 0]:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||
|
||||
args.model_type = args.model_type.lower()
|
||||
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
|
||||
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
|
||||
num_labels=num_labels)
|
||||
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
|
||||
do_lower_case=args.do_lower_case)
|
||||
model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path),
|
||||
config=config)
|
||||
|
||||
if args.local_rank == 0:
|
||||
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
|
||||
|
||||
model.to(args.device)
|
||||
|
||||
logger.info("Training/evaluation parameters %s", args)
|
||||
|
||||
# Training
|
||||
if args.do_train:
|
||||
train_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode="train")
|
||||
global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels, pad_token_label_id)
|
||||
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
|
||||
|
||||
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
|
||||
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
|
||||
# Create output directory if needed
|
||||
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
|
||||
os.makedirs(args.output_dir)
|
||||
|
||||
logger.info("Saving model checkpoint to %s", args.output_dir)
|
||||
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
|
||||
# They can then be reloaded using `from_pretrained()`
|
||||
model_to_save = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
|
||||
model_to_save.save_pretrained(args.output_dir)
|
||||
tokenizer.save_pretrained(args.output_dir)
|
||||
|
||||
# Good practice: save your training arguments together with the trained model
|
||||
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
|
||||
|
||||
# Evaluation
|
||||
results = {}
|
||||
if args.do_eval and args.local_rank in [-1, 0]:
|
||||
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
||||
checkpoints = [args.output_dir]
|
||||
if args.eval_all_checkpoints:
|
||||
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)))
|
||||
logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
for checkpoint in checkpoints:
|
||||
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
|
||||
model = model_class.from_pretrained(checkpoint)
|
||||
model.to(args.device)
|
||||
result, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev", prefix=global_step)
|
||||
if global_step:
|
||||
result = {"{}_{}".format(global_step, k): v for k, v in result.items()}
|
||||
results.update(result)
|
||||
output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
|
||||
with open(output_eval_file, "w") as writer:
|
||||
for key in sorted(results.keys()):
|
||||
writer.write("{} = {}\n".format(key, str(results[key])))
|
||||
|
||||
if args.do_predict and args.local_rank in [-1, 0]:
|
||||
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
|
||||
model = model_class.from_pretrained(args.output_dir)
|
||||
model.to(args.device)
|
||||
result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="test")
|
||||
# Save results
|
||||
output_test_results_file = os.path.join(args.output_dir, "test_results.txt")
|
||||
with open(output_test_results_file, "w") as writer:
|
||||
for key in sorted(result.keys()):
|
||||
writer.write("{} = {}\n".format(key, str(result[key])))
|
||||
# Save predictions
|
||||
output_test_predictions_file = os.path.join(args.output_dir, "test_predictions.txt")
|
||||
with open(output_test_predictions_file, "w") as writer:
|
||||
with open(os.path.join(args.data_dir, "test.txt"), "r") as f:
|
||||
example_id = 0
|
||||
for line in f:
|
||||
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
|
||||
writer.write(line)
|
||||
if not predictions[example_id]:
|
||||
example_id += 1
|
||||
elif predictions[example_id]:
|
||||
output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
|
||||
writer.write(output_line)
|
||||
else:
|
||||
logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
|
||||
|
||||
return results
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -138,8 +138,8 @@ def train(args, train_dataset, model, tokenizer):
|
||||
model.train()
|
||||
batch = tuple(t.to(args.device) for t in batch)
|
||||
inputs = {'input_ids': batch[0],
|
||||
'attention_mask': batch[1],
|
||||
'start_positions': batch[3],
|
||||
'attention_mask': batch[1],
|
||||
'start_positions': batch[3],
|
||||
'end_positions': batch[4]}
|
||||
if args.model_type != 'distilbert':
|
||||
inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
|
||||
@@ -157,13 +157,16 @@ def train(args, train_dataset, model, tokenizer):
|
||||
if args.fp16:
|
||||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||
else:
|
||||
loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||
|
||||
tr_loss += loss.item()
|
||||
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||
if args.fp16:
|
||||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||
else:
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||
|
||||
optimizer.step()
|
||||
scheduler.step() # Update learning rate schedule
|
||||
model.zero_grad()
|
||||
@@ -485,6 +488,16 @@ def main():
|
||||
|
||||
logger.info("Training/evaluation parameters %s", args)
|
||||
|
||||
# Before we do anything with models, we want to ensure that we get fp16 execution of torch.einsum if args.fp16 is set.
|
||||
# Otherwise it'll default to "promote" mode, and we'll get fp32 operations. Note that running `--fp16_opt_level="O2"` will
|
||||
# remove the need for this code, but it is still valid.
|
||||
if args.fp16:
|
||||
try:
|
||||
import apex
|
||||
apex.amp.register_half_function(torch, 'einsum')
|
||||
except ImportError:
|
||||
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
|
||||
|
||||
# Training
|
||||
if args.do_train:
|
||||
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
|
||||
|
||||
488
examples/run_summarization_finetuning.py
Normal file
488
examples/run_summarization_finetuning.py
Normal file
@@ -0,0 +1,488 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2019 The HuggingFace Inc. team.
|
||||
# Copyright (c) 2019 The HuggingFace Inc. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Finetuning seq2seq models for sequence generation."""
|
||||
|
||||
import argparse
|
||||
import functools
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
from tqdm import tqdm, trange
|
||||
import torch
|
||||
from torch.optim import Adam
|
||||
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
|
||||
|
||||
from transformers import (
|
||||
AutoTokenizer,
|
||||
BertForMaskedLM,
|
||||
BertConfig,
|
||||
PreTrainedEncoderDecoder,
|
||||
Model2Model,
|
||||
)
|
||||
|
||||
from utils_summarization import (
|
||||
CNNDailyMailDataset,
|
||||
encode_for_summarization,
|
||||
fit_to_block_size,
|
||||
build_lm_labels,
|
||||
build_mask,
|
||||
compute_token_type_ids,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
|
||||
|
||||
|
||||
def set_seed(args):
|
||||
random.seed(args.seed)
|
||||
np.random.seed(args.seed)
|
||||
torch.manual_seed(args.seed)
|
||||
|
||||
|
||||
# ------------
|
||||
# Load dataset
|
||||
# ------------
|
||||
|
||||
|
||||
def load_and_cache_examples(args, tokenizer):
|
||||
dataset = CNNDailyMailDataset(tokenizer, data_dir=args.data_dir)
|
||||
return dataset
|
||||
|
||||
|
||||
def collate(data, tokenizer, block_size):
|
||||
""" List of tuple as an input. """
|
||||
# remove the files with empty an story/summary, encode and fit to block
|
||||
data = filter(lambda x: not (len(x[0]) == 0 or len(x[1]) == 0), data)
|
||||
data = [
|
||||
encode_for_summarization(story, summary, tokenizer) for story, summary in data
|
||||
]
|
||||
data = [
|
||||
(
|
||||
fit_to_block_size(story, block_size, tokenizer.pad_token_id),
|
||||
fit_to_block_size(summary, block_size, tokenizer.pad_token_id),
|
||||
)
|
||||
for story, summary in data
|
||||
]
|
||||
|
||||
stories = torch.tensor([story for story, summary in data])
|
||||
summaries = torch.tensor([summary for story, summary in data])
|
||||
encoder_token_type_ids = compute_token_type_ids(stories, tokenizer.cls_token_id)
|
||||
encoder_mask = build_mask(stories, tokenizer.pad_token_id)
|
||||
decoder_mask = build_mask(summaries, tokenizer.pad_token_id)
|
||||
lm_labels = build_lm_labels(summaries, tokenizer.pad_token_id)
|
||||
|
||||
return (
|
||||
stories,
|
||||
summaries,
|
||||
encoder_token_type_ids,
|
||||
encoder_mask,
|
||||
decoder_mask,
|
||||
lm_labels,
|
||||
)
|
||||
|
||||
|
||||
# ----------
|
||||
# Optimizers
|
||||
# ----------
|
||||
|
||||
|
||||
class BertSumOptimizer(object):
|
||||
""" Specific optimizer for BertSum.
|
||||
|
||||
As described in [1], the authors fine-tune BertSum for abstractive
|
||||
summarization using two Adam Optimizers with different warm-up steps and
|
||||
learning rate. They also use a custom learning rate scheduler.
|
||||
|
||||
[1] Liu, Yang, and Mirella Lapata. "Text summarization with pretrained encoders."
|
||||
arXiv preprint arXiv:1908.08345 (2019).
|
||||
"""
|
||||
|
||||
def __init__(self, model, lr, warmup_steps, beta_1=0.99, beta_2=0.999, eps=1e-8):
|
||||
self.encoder = model.encoder
|
||||
self.decoder = model.decoder
|
||||
self.lr = lr
|
||||
self.warmup_steps = warmup_steps
|
||||
|
||||
self.optimizers = {
|
||||
"encoder": Adam(
|
||||
model.encoder.parameters(),
|
||||
lr=lr["encoder"],
|
||||
betas=(beta_1, beta_2),
|
||||
eps=eps,
|
||||
),
|
||||
"decoder": Adam(
|
||||
model.decoder.parameters(),
|
||||
lr=lr["decoder"],
|
||||
betas=(beta_1, beta_2),
|
||||
eps=eps,
|
||||
),
|
||||
}
|
||||
|
||||
self._step = 0
|
||||
|
||||
def _update_rate(self, stack):
|
||||
return self.lr[stack] * min(
|
||||
self._step ** (-0.5), self._step * self.warmup_steps[stack] ** (-0.5)
|
||||
)
|
||||
|
||||
def zero_grad(self):
|
||||
self.optimizer_decoder.zero_grad()
|
||||
self.optimizer_encoder.zero_grad()
|
||||
|
||||
def step(self):
|
||||
self._step += 1
|
||||
for stack, optimizer in self.optimizers.items():
|
||||
new_rate = self._update_rate(stack)
|
||||
for param_group in optimizer.param_groups:
|
||||
param_group["lr"] = new_rate
|
||||
optimizer.step()
|
||||
|
||||
|
||||
# ------------
|
||||
# Train
|
||||
# ------------
|
||||
|
||||
|
||||
def train(args, model, tokenizer):
|
||||
""" Fine-tune the pretrained model on the corpus. """
|
||||
set_seed(args)
|
||||
|
||||
# Load the data
|
||||
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
|
||||
train_dataset = load_and_cache_examples(args, tokenizer)
|
||||
train_sampler = RandomSampler(train_dataset)
|
||||
model_collate_fn = functools.partial(collate, tokenizer=tokenizer, block_size=512)
|
||||
train_dataloader = DataLoader(
|
||||
train_dataset,
|
||||
sampler=train_sampler,
|
||||
batch_size=args.train_batch_size,
|
||||
collate_fn=model_collate_fn,
|
||||
)
|
||||
|
||||
# Training schedule
|
||||
if args.max_steps > 0:
|
||||
t_total = args.max_steps
|
||||
args.num_train_epochs = t_total // (
|
||||
len(train_dataloader) // args.gradient_accumulation_steps + 1
|
||||
)
|
||||
else:
|
||||
t_total = (
|
||||
len(train_dataloader)
|
||||
// args.gradient_accumulation_steps
|
||||
* args.num_train_epochs
|
||||
)
|
||||
|
||||
# Prepare the optimizer
|
||||
lr = {"encoder": 0.002, "decoder": 0.2}
|
||||
warmup_steps = {"encoder": 20000, "decoder": 10000}
|
||||
optimizer = BertSumOptimizer(model, lr, warmup_steps)
|
||||
|
||||
# Train
|
||||
logger.info("***** Running training *****")
|
||||
logger.info(" Num examples = %d", len(train_dataset))
|
||||
logger.info(" Num Epochs = %d", args.num_train_epochs)
|
||||
logger.info(
|
||||
" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size
|
||||
)
|
||||
logger.info(
|
||||
" Total train batch size (w. parallel, distributed & accumulation) = %d",
|
||||
args.train_batch_size * args.gradient_accumulation_steps
|
||||
# * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
|
||||
)
|
||||
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
|
||||
logger.info(" Total optimization steps = %d", t_total)
|
||||
|
||||
model.zero_grad()
|
||||
train_iterator = trange(args.num_train_epochs, desc="Epoch", disable=True)
|
||||
|
||||
global_step = 0
|
||||
tr_loss = 0.0
|
||||
for _ in train_iterator:
|
||||
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=True)
|
||||
for step, batch in enumerate(epoch_iterator):
|
||||
source, target, encoder_token_type_ids, encoder_mask, decoder_mask, lm_labels = batch
|
||||
|
||||
source = source.to(args.device)
|
||||
target = target.to(args.device)
|
||||
encoder_token_type_ids = encoder_token_type_ids.to(args.device)
|
||||
encoder_mask = encoder_mask.to(args.device)
|
||||
decoder_mask = decoder_mask.to(args.device)
|
||||
lm_labels = lm_labels.to(args.device)
|
||||
|
||||
model.train()
|
||||
outputs = model(
|
||||
source,
|
||||
target,
|
||||
encoder_token_type_ids=encoder_token_type_ids,
|
||||
encoder_attention_mask=encoder_mask,
|
||||
decoder_attention_mask=decoder_mask,
|
||||
decoder_lm_labels=lm_labels,
|
||||
)
|
||||
|
||||
loss = outputs[0]
|
||||
print(loss)
|
||||
if args.gradient_accumulation_steps > 1:
|
||||
loss /= args.gradient_accumulation_steps
|
||||
|
||||
loss.backward()
|
||||
|
||||
tr_loss += loss.item()
|
||||
if (step + 1) % args.gradient_accumulation_steps == 0:
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
|
||||
optimizer.step()
|
||||
model.zero_grad()
|
||||
global_step += 1
|
||||
|
||||
if args.max_steps > 0 and global_step > args.max_steps:
|
||||
epoch_iterator.close()
|
||||
break
|
||||
|
||||
if args.max_steps > 0 and global_step > args.max_steps:
|
||||
train_iterator.close()
|
||||
break
|
||||
|
||||
return global_step, tr_loss / global_step
|
||||
|
||||
|
||||
# ------------
|
||||
# Train
|
||||
# ------------
|
||||
|
||||
|
||||
def evaluate(args, model, tokenizer, prefix=""):
|
||||
set_seed(args)
|
||||
|
||||
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
|
||||
eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)
|
||||
eval_sampler = SequentialSampler(eval_dataset)
|
||||
eval_dataloader = DataLoader(
|
||||
eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size
|
||||
)
|
||||
|
||||
logger.info("***** Running evaluation {} *****".format(prefix))
|
||||
logger.info(" Num examples = %d", len(eval_dataset))
|
||||
logger.info(" Batch size = %d", args.eval_batch_size)
|
||||
eval_loss = 0.0
|
||||
nb_eval_steps = 0
|
||||
model.eval()
|
||||
|
||||
for batch in tqdm(eval_dataloader, desc="Evaluating"):
|
||||
source, target, encoder_token_type_ids, encoder_mask, decoder_mask, lm_labels = batch
|
||||
|
||||
source = source.to(args.device)
|
||||
target = target.to(args.device)
|
||||
encoder_token_type_ids = encoder_token_type_ids.to(args.device)
|
||||
encoder_mask = encoder_mask.to(args.device)
|
||||
decoder_mask = decoder_mask.to(args.device)
|
||||
lm_labels = lm_labels.to(args.device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(
|
||||
source,
|
||||
target,
|
||||
encoder_token_type_ids=encoder_token_type_ids,
|
||||
encoder_attention_mask=encoder_mask,
|
||||
decoder_attention_mask=decoder_mask,
|
||||
decoder_lm_labels=lm_labels,
|
||||
)
|
||||
lm_loss = outputs[0]
|
||||
eval_loss += lm_loss.mean().item()
|
||||
nb_eval_steps += 1
|
||||
|
||||
eval_loss = eval_loss / nb_eval_steps
|
||||
perplexity = torch.exp(torch.tensor(eval_loss))
|
||||
|
||||
result = {"perplexity": perplexity}
|
||||
|
||||
# Save the evaluation's results
|
||||
output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
|
||||
if not os.path.exists(args.output_dir):
|
||||
os.makedirs(args.output_dir)
|
||||
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results {} *****".format(prefix))
|
||||
for key in sorted(result.keys()):
|
||||
logger.info(" %s = %s", key, str(result[key]))
|
||||
writer.write("%s = %s\n" % (key, str(result[key])))
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--data_dir",
|
||||
default=None,
|
||||
type=str,
|
||||
required=True,
|
||||
help="The input training data file (a text file).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output_dir",
|
||||
default=None,
|
||||
type=str,
|
||||
required=True,
|
||||
help="The output directory where the model predictions and checkpoints will be written.",
|
||||
)
|
||||
|
||||
# Optional parameters
|
||||
parser.add_argument(
|
||||
"--gradient_accumulation_steps",
|
||||
type=int,
|
||||
default=1,
|
||||
help="Number of updates steps to accumulate before performing a backward/update pass.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--do_evaluate",
|
||||
type=bool,
|
||||
default=False,
|
||||
help="Run model evaluation on out-of-sample data.",
|
||||
)
|
||||
parser.add_argument("--do_train", type=bool, default=False, help="Run training.")
|
||||
parser.add_argument(
|
||||
"--do_overwrite_output_dir",
|
||||
type=bool,
|
||||
default=False,
|
||||
help="Whether to overwrite the output dir.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model_name_or_path",
|
||||
default="bert-base-cased",
|
||||
type=str,
|
||||
help="The model checkpoint to initialize the encoder and decoder's weights with.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model_type",
|
||||
default="bert",
|
||||
type=str,
|
||||
help="The decoder architecture to be fine-tuned.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_grad_norm", default=1.0, type=float, help="Max gradient norm."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_steps",
|
||||
default=-1,
|
||||
type=int,
|
||||
help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--to_cpu", default=False, type=bool, help="Whether to force training on CPU."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_train_epochs",
|
||||
default=10,
|
||||
type=int,
|
||||
help="Total number of training epochs to perform.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--per_gpu_train_batch_size",
|
||||
default=4,
|
||||
type=int,
|
||||
help="Batch size per GPU/CPU for training.",
|
||||
)
|
||||
parser.add_argument("--seed", default=42, type=int)
|
||||
args = parser.parse_args()
|
||||
|
||||
if (
|
||||
os.path.exists(args.output_dir)
|
||||
and os.listdir(args.output_dir)
|
||||
and args.do_train
|
||||
and not args.do_overwrite_output_dir
|
||||
):
|
||||
raise ValueError(
|
||||
"Output directory ({}) already exists and is not empty. Use --do_overwrite_output_dir to overwrite.".format(
|
||||
args.output_dir
|
||||
)
|
||||
)
|
||||
|
||||
# Set up training device
|
||||
if args.to_cpu or not torch.cuda.is_available():
|
||||
args.device = torch.device("cpu")
|
||||
args.n_gpu = 0
|
||||
else:
|
||||
args.device = torch.device("cuda")
|
||||
args.n_gpu = torch.cuda.device_count()
|
||||
|
||||
# Load pretrained model and tokenizer. The decoder's weights are randomly initialized.
|
||||
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
|
||||
config = BertConfig.from_pretrained(args.model_name_or_path)
|
||||
decoder_model = BertForMaskedLM(config)
|
||||
model = Model2Model.from_pretrained(
|
||||
args.model_name_or_path, decoder_model=decoder_model
|
||||
)
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
|
||||
datefmt="%m/%d/%Y %H:%M:%S",
|
||||
level=logging.INFO,
|
||||
)
|
||||
logger.warning(
|
||||
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
|
||||
0,
|
||||
args.device,
|
||||
args.n_gpu,
|
||||
False,
|
||||
False,
|
||||
)
|
||||
|
||||
logger.info("Training/evaluation parameters %s", args)
|
||||
|
||||
# Train the model
|
||||
model.to(args.device)
|
||||
if args.do_train:
|
||||
global_step, tr_loss = train(args, model, tokenizer)
|
||||
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
|
||||
|
||||
if not os.path.exists(args.output_dir):
|
||||
os.makedirs(args.output_dir)
|
||||
|
||||
logger.info("Saving model checkpoint to %s", args.output_dir)
|
||||
|
||||
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
|
||||
# They can then be reloaded using `from_pretrained()`
|
||||
model_to_save = (
|
||||
model.module if hasattr(model, "module") else model
|
||||
) # Take care of distributed/parallel training
|
||||
model_to_save.save_pretrained(args.output_dir)
|
||||
tokenizer.save_pretrained(args.output_dir)
|
||||
torch.save(args, os.path.join(args.output_dir, "training_arguments.bin"))
|
||||
|
||||
# Evaluate the model
|
||||
results = {}
|
||||
if args.do_evaluate:
|
||||
checkpoints = []
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
for checkpoint in checkpoints:
|
||||
encoder_checkpoint = os.path.join(checkpoint, "encoder")
|
||||
decoder_checkpoint = os.path.join(checkpoint, "decoder")
|
||||
model = PreTrainedEncoderDecoder.from_pretrained(
|
||||
encoder_checkpoint, decoder_checkpoint
|
||||
)
|
||||
model.to(args.device)
|
||||
results = "placeholder"
|
||||
|
||||
return results
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,40 +1,91 @@
|
||||
import os
|
||||
import tensorflow as tf
|
||||
import tensorflow_datasets
|
||||
from transformers import BertTokenizer, TFBertForSequenceClassification, glue_convert_examples_to_features, BertForSequenceClassification
|
||||
from transformers import BertTokenizer, TFBertForSequenceClassification, BertConfig, glue_convert_examples_to_features, BertForSequenceClassification, glue_processors
|
||||
|
||||
# Load dataset, tokenizer, model from pretrained model/vocabulary
|
||||
# script parameters
|
||||
BATCH_SIZE = 32
|
||||
EVAL_BATCH_SIZE = BATCH_SIZE * 2
|
||||
USE_XLA = False
|
||||
USE_AMP = False
|
||||
EPOCHS = 3
|
||||
|
||||
TASK = "mrpc"
|
||||
|
||||
if TASK == "sst-2":
|
||||
TFDS_TASK = "sst2"
|
||||
elif TASK == "sts-b":
|
||||
TFDS_TASK = "stsb"
|
||||
else:
|
||||
TFDS_TASK = TASK
|
||||
|
||||
num_labels = len(glue_processors[TASK]().get_labels())
|
||||
print(num_labels)
|
||||
|
||||
tf.config.optimizer.set_jit(USE_XLA)
|
||||
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": USE_AMP})
|
||||
|
||||
# Load tokenizer and model from pretrained model/vocabulary. Specify the number of labels to classify (2+: classification, 1: regression)
|
||||
config = BertConfig.from_pretrained("bert-base-cased", num_labels=num_labels)
|
||||
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
|
||||
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
|
||||
data = tensorflow_datasets.load('glue/mrpc')
|
||||
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', config=config)
|
||||
|
||||
# Load dataset via TensorFlow Datasets
|
||||
data, info = tensorflow_datasets.load(f'glue/{TFDS_TASK}', with_info=True)
|
||||
train_examples = info.splits['train'].num_examples
|
||||
|
||||
# MNLI expects either validation_matched or validation_mismatched
|
||||
valid_examples = info.splits['validation'].num_examples
|
||||
|
||||
# Prepare dataset for GLUE as a tf.data.Dataset instance
|
||||
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, 128, 'mrpc')
|
||||
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, 'mrpc')
|
||||
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
|
||||
valid_dataset = valid_dataset.batch(64)
|
||||
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, 128, TASK)
|
||||
|
||||
# MNLI expects either validation_matched or validation_mismatched
|
||||
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, TASK)
|
||||
train_dataset = train_dataset.shuffle(128).batch(BATCH_SIZE).repeat(-1)
|
||||
valid_dataset = valid_dataset.batch(EVAL_BATCH_SIZE)
|
||||
|
||||
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
|
||||
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
|
||||
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
|
||||
opt = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
|
||||
if USE_AMP:
|
||||
# loss scaling is currently required when using mixed precision
|
||||
opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt, 'dynamic')
|
||||
|
||||
|
||||
if num_labels == 1:
|
||||
loss = tf.keras.losses.MeanSquaredError()
|
||||
else:
|
||||
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
|
||||
|
||||
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
|
||||
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
|
||||
model.compile(optimizer=opt, loss=loss, metrics=[metric])
|
||||
|
||||
# Train and evaluate using tf.keras.Model.fit()
|
||||
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
|
||||
validation_data=valid_dataset, validation_steps=7)
|
||||
train_steps = train_examples//BATCH_SIZE
|
||||
valid_steps = valid_examples//EVAL_BATCH_SIZE
|
||||
|
||||
# Load the TensorFlow model in PyTorch for inspection
|
||||
history = model.fit(train_dataset, epochs=EPOCHS, steps_per_epoch=train_steps,
|
||||
validation_data=valid_dataset, validation_steps=valid_steps)
|
||||
|
||||
# Save TF2 model
|
||||
os.makedirs('./save/', exist_ok=True)
|
||||
model.save_pretrained('./save/')
|
||||
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
|
||||
|
||||
# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
|
||||
sentence_0 = "This research was consistent with his findings."
|
||||
sentence_1 = "His findings were compatible with this research."
|
||||
sentence_2 = "His findings were not compatible with this research."
|
||||
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
|
||||
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
|
||||
if TASK == "mrpc":
|
||||
# Load the TensorFlow model in PyTorch for inspection
|
||||
pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
|
||||
|
||||
pred_1 = pytorch_model(**inputs_1)[0].argmax().item()
|
||||
pred_2 = pytorch_model(**inputs_2)[0].argmax().item()
|
||||
print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
|
||||
print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")
|
||||
# Quickly test a few predictions - MRPC is a paraphrasing task, let's see if our model learned the task
|
||||
sentence_0 = 'This research was consistent with his findings.'
|
||||
sentence_1 = 'His findings were compatible with this research.'
|
||||
sentence_2 = 'His findings were not compatible with this research.'
|
||||
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
|
||||
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
|
||||
|
||||
del inputs_1["special_tokens_mask"]
|
||||
del inputs_2["special_tokens_mask"]
|
||||
|
||||
pred_1 = pytorch_model(**inputs_1)[0].argmax().item()
|
||||
pred_2 = pytorch_model(**inputs_2)[0].argmax().item()
|
||||
print('sentence_1 is', 'a paraphrase' if pred_1 else 'not a paraphrase', 'of sentence_0')
|
||||
print('sentence_2 is', 'a paraphrase' if pred_2 else 'not a paraphrase', 'of sentence_0')
|
||||
|
||||
212
examples/utils_ner.py
Normal file
212
examples/utils_ner.py
Normal file
@@ -0,0 +1,212 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Named entity recognition fine-tuning: utilities to work with CoNLL-2003 task. """
|
||||
|
||||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
import logging
|
||||
import os
|
||||
from io import open
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class InputExample(object):
|
||||
"""A single training/test example for token classification."""
|
||||
|
||||
def __init__(self, guid, words, labels):
|
||||
"""Constructs a InputExample.
|
||||
|
||||
Args:
|
||||
guid: Unique id for the example.
|
||||
words: list. The words of the sequence.
|
||||
labels: (Optional) list. The labels for each word of the sequence. This should be
|
||||
specified for train and dev examples, but not for test examples.
|
||||
"""
|
||||
self.guid = guid
|
||||
self.words = words
|
||||
self.labels = labels
|
||||
|
||||
|
||||
class InputFeatures(object):
|
||||
"""A single set of features of data."""
|
||||
|
||||
def __init__(self, input_ids, input_mask, segment_ids, label_ids):
|
||||
self.input_ids = input_ids
|
||||
self.input_mask = input_mask
|
||||
self.segment_ids = segment_ids
|
||||
self.label_ids = label_ids
|
||||
|
||||
|
||||
def read_examples_from_file(data_dir, mode):
|
||||
file_path = os.path.join(data_dir, "{}.txt".format(mode))
|
||||
guid_index = 1
|
||||
examples = []
|
||||
with open(file_path, encoding="utf-8") as f:
|
||||
words = []
|
||||
labels = []
|
||||
for line in f:
|
||||
if line.startswith("-DOCSTART-") or line == "" or line == "\n":
|
||||
if words:
|
||||
examples.append(InputExample(guid="{}-{}".format(mode, guid_index),
|
||||
words=words,
|
||||
labels=labels))
|
||||
guid_index += 1
|
||||
words = []
|
||||
labels = []
|
||||
else:
|
||||
splits = line.split(" ")
|
||||
words.append(splits[0])
|
||||
if len(splits) > 1:
|
||||
labels.append(splits[-1].replace("\n", ""))
|
||||
else:
|
||||
# Examples could have no label for mode = "test"
|
||||
labels.append("O")
|
||||
if words:
|
||||
examples.append(InputExample(guid="%s-%d".format(mode, guid_index),
|
||||
words=words,
|
||||
labels=labels))
|
||||
return examples
|
||||
|
||||
|
||||
def convert_examples_to_features(examples,
|
||||
label_list,
|
||||
max_seq_length,
|
||||
tokenizer,
|
||||
cls_token_at_end=False,
|
||||
cls_token="[CLS]",
|
||||
cls_token_segment_id=1,
|
||||
sep_token="[SEP]",
|
||||
sep_token_extra=False,
|
||||
pad_on_left=False,
|
||||
pad_token=0,
|
||||
pad_token_segment_id=0,
|
||||
pad_token_label_id=-1,
|
||||
sequence_a_segment_id=0,
|
||||
mask_padding_with_zero=True):
|
||||
""" Loads a data file into a list of `InputBatch`s
|
||||
`cls_token_at_end` define the location of the CLS token:
|
||||
- False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
|
||||
- True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
|
||||
`cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
|
||||
"""
|
||||
|
||||
label_map = {label: i for i, label in enumerate(label_list)}
|
||||
|
||||
features = []
|
||||
for (ex_index, example) in enumerate(examples):
|
||||
if ex_index % 10000 == 0:
|
||||
logger.info("Writing example %d of %d", ex_index, len(examples))
|
||||
|
||||
tokens = []
|
||||
label_ids = []
|
||||
for word, label in zip(example.words, example.labels):
|
||||
word_tokens = tokenizer.tokenize(word)
|
||||
tokens.extend(word_tokens)
|
||||
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
|
||||
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
|
||||
|
||||
# Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
|
||||
special_tokens_count = 3 if sep_token_extra else 2
|
||||
if len(tokens) > max_seq_length - special_tokens_count:
|
||||
tokens = tokens[:(max_seq_length - special_tokens_count)]
|
||||
label_ids = label_ids[:(max_seq_length - special_tokens_count)]
|
||||
|
||||
# The convention in BERT is:
|
||||
# (a) For sequence pairs:
|
||||
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
|
||||
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
|
||||
# (b) For single sequences:
|
||||
# tokens: [CLS] the dog is hairy . [SEP]
|
||||
# type_ids: 0 0 0 0 0 0 0
|
||||
#
|
||||
# Where "type_ids" are used to indicate whether this is the first
|
||||
# sequence or the second sequence. The embedding vectors for `type=0` and
|
||||
# `type=1` were learned during pre-training and are added to the wordpiece
|
||||
# embedding vector (and position vector). This is not *strictly* necessary
|
||||
# since the [SEP] token unambiguously separates the sequences, but it makes
|
||||
# it easier for the model to learn the concept of sequences.
|
||||
#
|
||||
# For classification tasks, the first vector (corresponding to [CLS]) is
|
||||
# used as as the "sentence vector". Note that this only makes sense because
|
||||
# the entire model is fine-tuned.
|
||||
tokens += [sep_token]
|
||||
label_ids += [pad_token_label_id]
|
||||
if sep_token_extra:
|
||||
# roberta uses an extra separator b/w pairs of sentences
|
||||
tokens += [sep_token]
|
||||
label_ids += [pad_token_label_id]
|
||||
segment_ids = [sequence_a_segment_id] * len(tokens)
|
||||
|
||||
if cls_token_at_end:
|
||||
tokens += [cls_token]
|
||||
label_ids += [pad_token_label_id]
|
||||
segment_ids += [cls_token_segment_id]
|
||||
else:
|
||||
tokens = [cls_token] + tokens
|
||||
label_ids = [pad_token_label_id] + label_ids
|
||||
segment_ids = [cls_token_segment_id] + segment_ids
|
||||
|
||||
input_ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||
|
||||
# The mask has 1 for real tokens and 0 for padding tokens. Only real
|
||||
# tokens are attended to.
|
||||
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
|
||||
|
||||
# Zero-pad up to the sequence length.
|
||||
padding_length = max_seq_length - len(input_ids)
|
||||
if pad_on_left:
|
||||
input_ids = ([pad_token] * padding_length) + input_ids
|
||||
input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
|
||||
segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
|
||||
label_ids = ([pad_token_label_id] * padding_length) + label_ids
|
||||
else:
|
||||
input_ids += ([pad_token] * padding_length)
|
||||
input_mask += ([0 if mask_padding_with_zero else 1] * padding_length)
|
||||
segment_ids += ([pad_token_segment_id] * padding_length)
|
||||
label_ids += ([pad_token_label_id] * padding_length)
|
||||
|
||||
assert len(input_ids) == max_seq_length
|
||||
assert len(input_mask) == max_seq_length
|
||||
assert len(segment_ids) == max_seq_length
|
||||
assert len(label_ids) == max_seq_length
|
||||
|
||||
if ex_index < 5:
|
||||
logger.info("*** Example ***")
|
||||
logger.info("guid: %s", example.guid)
|
||||
logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
|
||||
logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
|
||||
logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
|
||||
logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
|
||||
logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
|
||||
|
||||
features.append(
|
||||
InputFeatures(input_ids=input_ids,
|
||||
input_mask=input_mask,
|
||||
segment_ids=segment_ids,
|
||||
label_ids=label_ids))
|
||||
return features
|
||||
|
||||
|
||||
def get_labels(path):
|
||||
if path:
|
||||
with open(path, "r") as f:
|
||||
labels = f.read().splitlines()
|
||||
if "O" not in labels:
|
||||
labels = ["O"] + labels
|
||||
return labels
|
||||
else:
|
||||
return ["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
|
||||
184
examples/utils_summarization.py
Normal file
184
examples/utils_summarization.py
Normal file
@@ -0,0 +1,184 @@
|
||||
from collections import deque
|
||||
import os
|
||||
|
||||
import torch
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
|
||||
# ------------
|
||||
# Data loading
|
||||
# ------------
|
||||
|
||||
|
||||
class CNNDailyMailDataset(Dataset):
|
||||
""" Abstracts the dataset used to train seq2seq models.
|
||||
|
||||
CNN/Daily News:
|
||||
|
||||
The CNN/Daily News raw datasets are downloaded from [1]. The stories are
|
||||
stored in different files; the summary appears at the end of the story as
|
||||
sentences that are prefixed by the special `@highlight` line. To process
|
||||
the data, untar both datasets in the same folder, and pass the path to this
|
||||
folder as the "data_dir argument. The formatting code was inspired by [2].
|
||||
|
||||
[1] https://cs.nyu.edu/~kcho/
|
||||
[2] https://github.com/abisee/cnn-dailymail/
|
||||
"""
|
||||
|
||||
def __init__(self, tokenizer, prefix="train", data_dir=""):
|
||||
assert os.path.isdir(data_dir)
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
# We initialize the class by listing all the files that contain
|
||||
# stories and summaries. Files are not read in memory given
|
||||
# the size of the corpus.
|
||||
self.stories_path = []
|
||||
datasets = ("cnn", "dailymail")
|
||||
for dataset in datasets:
|
||||
path_to_stories = os.path.join(data_dir, dataset, "stories")
|
||||
story_filenames_list = os.listdir(path_to_stories)
|
||||
for story_filename in story_filenames_list:
|
||||
path_to_story = os.path.join(path_to_stories, story_filename)
|
||||
if not os.path.isfile(path_to_story):
|
||||
continue
|
||||
self.stories_path.append(path_to_story)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.stories_path)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
story_path = self.stories_path[idx]
|
||||
with open(story_path, encoding="utf-8") as source:
|
||||
raw_story = source.read()
|
||||
story_lines, summary_lines = process_story(raw_story)
|
||||
return story_lines, summary_lines
|
||||
|
||||
|
||||
def process_story(raw_story):
|
||||
""" Extract the story and summary from a story file.
|
||||
|
||||
Attributes:
|
||||
raw_story (str): content of the story file as an utf-8 encoded string.
|
||||
|
||||
Raises:
|
||||
IndexError: If the stoy is empty or contains no highlights.
|
||||
"""
|
||||
nonempty_lines = list(
|
||||
filter(lambda x: len(x) != 0, [line.strip() for line in raw_story.split("\n")])
|
||||
)
|
||||
|
||||
# for some unknown reason some lines miss a period, add it
|
||||
nonempty_lines = [_add_missing_period(line) for line in nonempty_lines]
|
||||
|
||||
# gather article lines
|
||||
story_lines = []
|
||||
lines = deque(nonempty_lines)
|
||||
while True:
|
||||
try:
|
||||
element = lines.popleft()
|
||||
if element.startswith("@highlight"):
|
||||
break
|
||||
story_lines.append(element)
|
||||
except IndexError:
|
||||
# if "@highlight" is absent from the file we pop
|
||||
# all elements until there is None.
|
||||
return story_lines, []
|
||||
|
||||
# gather summary lines
|
||||
summary_lines = list(filter(lambda t: not t.startswith("@highlight"), lines))
|
||||
|
||||
return story_lines, summary_lines
|
||||
|
||||
|
||||
def _add_missing_period(line):
|
||||
END_TOKENS = [".", "!", "?", "...", "'", "`", '"', u"\u2019", u"\u2019", ")"]
|
||||
if line.startswith("@highlight"):
|
||||
return line
|
||||
if line[-1] in END_TOKENS:
|
||||
return line
|
||||
return line + "."
|
||||
|
||||
|
||||
# --------------------------
|
||||
# Encoding and preprocessing
|
||||
# --------------------------
|
||||
|
||||
|
||||
def fit_to_block_size(sequence, block_size, pad_token):
|
||||
""" Adapt the source and target sequences' lengths to the block size.
|
||||
If the sequence is shorter than the block size we pad it with -1 ids
|
||||
which correspond to padding tokens.
|
||||
"""
|
||||
if len(sequence) > block_size:
|
||||
return sequence[:block_size]
|
||||
else:
|
||||
sequence.extend([pad_token] * (block_size - len(sequence)))
|
||||
return sequence
|
||||
|
||||
|
||||
def build_lm_labels(sequence, pad_token):
|
||||
""" Padding token, encoded as 0, are represented by the value -1 so they
|
||||
are not taken into account in the loss computation. """
|
||||
padded = sequence.clone()
|
||||
padded[padded == pad_token] = -1
|
||||
return padded
|
||||
|
||||
|
||||
def build_mask(sequence, pad_token):
|
||||
""" Builds the mask. The attention mechanism will only attend to positions
|
||||
with value 1. """
|
||||
mask = torch.ones_like(sequence)
|
||||
idx_pad_tokens = sequence == pad_token
|
||||
mask[idx_pad_tokens] = 0
|
||||
return mask
|
||||
|
||||
|
||||
def encode_for_summarization(story_lines, summary_lines, tokenizer):
|
||||
""" Encode the story and summary lines, and join them
|
||||
as specified in [1] by using `[SEP] [CLS]` tokens to separate
|
||||
sentences.
|
||||
"""
|
||||
story_lines_token_ids = [
|
||||
tokenizer.add_special_tokens_single_sequence(tokenizer.encode(line))
|
||||
for line in story_lines
|
||||
]
|
||||
summary_lines_token_ids = [
|
||||
tokenizer.add_special_tokens_single_sequence(tokenizer.encode(line))
|
||||
for line in summary_lines
|
||||
]
|
||||
|
||||
story_token_ids = [
|
||||
token for sentence in story_lines_token_ids for token in sentence
|
||||
]
|
||||
summary_token_ids = [
|
||||
token for sentence in summary_lines_token_ids for token in sentence
|
||||
]
|
||||
|
||||
return story_token_ids, summary_token_ids
|
||||
|
||||
|
||||
def compute_token_type_ids(batch, separator_token_id):
|
||||
""" Segment embeddings as described in [1]
|
||||
|
||||
The values {0,1} were found in the repository [2].
|
||||
|
||||
Attributes:
|
||||
batch: torch.Tensor, size [batch_size, block_size]
|
||||
Batch of input.
|
||||
separator_token_id: int
|
||||
The value of the token that separates the segments.
|
||||
|
||||
[1] Liu, Yang, and Mirella Lapata. "Text summarization with pretrained encoders."
|
||||
arXiv preprint arXiv:1908.08345 (2019).
|
||||
[2] https://github.com/nlpyang/PreSumm (/src/prepro/data_builder.py, commit fac1217)
|
||||
"""
|
||||
batch_embeddings = []
|
||||
for sequence in batch:
|
||||
sentence_num = 0
|
||||
embeddings = []
|
||||
for s in sequence:
|
||||
if s == separator_token_id:
|
||||
sentence_num += 1
|
||||
embeddings.append(sentence_num % 2)
|
||||
batch_embeddings.append(embeddings)
|
||||
return torch.tensor(batch_embeddings)
|
||||
136
examples/utils_summarization_test.py
Normal file
136
examples/utils_summarization_test.py
Normal file
@@ -0,0 +1,136 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2019 HuggingFace Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from utils_summarization import (
|
||||
compute_token_type_ids,
|
||||
fit_to_block_size,
|
||||
build_mask,
|
||||
build_lm_labels,
|
||||
process_story,
|
||||
)
|
||||
|
||||
|
||||
class SummarizationDataProcessingTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.block_size = 10
|
||||
|
||||
def test_fit_to_block_sequence_too_small(self):
|
||||
""" Pad the sequence with 0 if the sequence is smaller than the block size."""
|
||||
sequence = [1, 2, 3, 4]
|
||||
expected_output = [1, 2, 3, 4, 0, 0, 0, 0, 0, 0]
|
||||
self.assertEqual(
|
||||
fit_to_block_size(sequence, self.block_size, 0), expected_output
|
||||
)
|
||||
|
||||
def test_fit_to_block_sequence_fit_exactly(self):
|
||||
""" Do nothing if the sequence is the right size. """
|
||||
sequence = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
|
||||
expected_output = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
|
||||
self.assertEqual(
|
||||
fit_to_block_size(sequence, self.block_size, 0), expected_output
|
||||
)
|
||||
|
||||
def test_fit_to_block_sequence_too_big(self):
|
||||
""" Truncate the sequence if it is too long. """
|
||||
sequence = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
|
||||
expected_output = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
|
||||
self.assertEqual(
|
||||
fit_to_block_size(sequence, self.block_size, 0), expected_output
|
||||
)
|
||||
|
||||
def test_process_story_no_highlights(self):
|
||||
""" Processing a story with no highlights returns an empty list for the summary.
|
||||
"""
|
||||
raw_story = """It was the year of Our Lord one thousand seven hundred and
|
||||
seventy-five.\n\nSpiritual revelations were conceded to England at that
|
||||
favoured period, as at this."""
|
||||
_, summary_lines = process_story(raw_story)
|
||||
self.assertEqual(summary_lines, [])
|
||||
|
||||
def test_process_empty_story(self):
|
||||
""" An empty story returns an empty collection of lines.
|
||||
"""
|
||||
raw_story = ""
|
||||
story_lines, summary_lines = process_story(raw_story)
|
||||
self.assertEqual(story_lines, [])
|
||||
self.assertEqual(summary_lines, [])
|
||||
|
||||
def test_process_story_with_missing_period(self):
|
||||
raw_story = (
|
||||
"It was the year of Our Lord one thousand seven hundred and "
|
||||
"seventy-five\n\nSpiritual revelations were conceded to England "
|
||||
"at that favoured period, as at this.\n@highlight\n\nIt was the best of times"
|
||||
)
|
||||
story_lines, summary_lines = process_story(raw_story)
|
||||
|
||||
expected_story_lines = [
|
||||
"It was the year of Our Lord one thousand seven hundred and seventy-five.",
|
||||
"Spiritual revelations were conceded to England at that favoured period, as at this.",
|
||||
]
|
||||
self.assertEqual(expected_story_lines, story_lines)
|
||||
|
||||
expected_summary_lines = ["It was the best of times."]
|
||||
self.assertEqual(expected_summary_lines, summary_lines)
|
||||
|
||||
def test_build_lm_labels_no_padding(self):
|
||||
sequence = torch.tensor([1, 2, 3, 4])
|
||||
expected = sequence
|
||||
np.testing.assert_array_equal(
|
||||
build_lm_labels(sequence, 0).numpy(), expected.numpy()
|
||||
)
|
||||
|
||||
def test_build_lm_labels(self):
|
||||
sequence = torch.tensor([1, 2, 3, 4, 0, 0, 0])
|
||||
expected = torch.tensor([1, 2, 3, 4, -1, -1, -1])
|
||||
np.testing.assert_array_equal(
|
||||
build_lm_labels(sequence, 0).numpy(), expected.numpy()
|
||||
)
|
||||
|
||||
def test_build_mask_no_padding(self):
|
||||
sequence = torch.tensor([1, 2, 3, 4])
|
||||
expected = torch.tensor([1, 1, 1, 1])
|
||||
np.testing.assert_array_equal(build_mask(sequence, 0).numpy(), expected.numpy())
|
||||
|
||||
def test_build_mask(self):
|
||||
sequence = torch.tensor([1, 2, 3, 4, 23, 23, 23])
|
||||
expected = torch.tensor([1, 1, 1, 1, 0, 0, 0])
|
||||
np.testing.assert_array_equal(
|
||||
build_mask(sequence, 23).numpy(), expected.numpy()
|
||||
)
|
||||
|
||||
def test_build_mask_with_padding_equal_to_one(self):
|
||||
sequence = torch.tensor([8, 2, 3, 4, 1, 1, 1])
|
||||
expected = torch.tensor([1, 1, 1, 1, 0, 0, 0])
|
||||
np.testing.assert_array_equal(build_mask(sequence, 1).numpy(), expected.numpy())
|
||||
|
||||
def test_compute_token_type_ids(self):
|
||||
separator = 101
|
||||
batch = torch.tensor(
|
||||
[[1, 2, 3, 4, 5, 6], [1, 2, 3, 101, 5, 6], [1, 101, 3, 4, 101, 6]]
|
||||
)
|
||||
expected = torch.tensor(
|
||||
[[0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0]]
|
||||
)
|
||||
|
||||
result = compute_token_type_ids(batch, separator)
|
||||
np.testing.assert_array_equal(result, expected)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
Reference in New Issue
Block a user