From c55badcee0c702f184aee2c85a0146c8804cc141 Mon Sep 17 00:00:00 2001 From: thomwolf Date: Tue, 15 Oct 2019 09:33:52 +0200 Subject: [PATCH] Add NER finetuning details by @stefan-it in example readme --- examples/README.md | 103 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 102 insertions(+), 1 deletion(-) diff --git a/examples/README.md b/examples/README.md index 382d794fcb..806601f9f3 100644 --- a/examples/README.md +++ b/examples/README.md @@ -8,8 +8,9 @@ similar API between the different models. | [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. | | [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. | | [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. | -| [SQuAD](#squad) | Using BERT for question answering, examples with distributed training. | +| [SQuAD](#squad) | Using BERT/XLM/XLNet/RoBERTa for question answering, examples with distributed training. | | [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. +| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. | ## Language model fine-tuning @@ -390,3 +391,103 @@ exact_match = 86.91 This fine-tuneds model is available as a checkpoint under the reference `bert-large-uncased-whole-word-masking-finetuned-squad`. +## Named Entity Recognition + +Based on the script [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py). +This example fine-tune Bert Multilingual on GermEval 2014 (German NER). +Details and results for the fine-tuning provided by @stefan-it. + +### Data (Download and pre-processing steps) + +Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page. + +Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted: + +```bash +curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' \ +| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp +curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' \ +| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp +curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \ +| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp +``` + +The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`. One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s. I wrote a script that a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached). + +```bash +wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py" +``` +Let's define some variables that we need for further pre-processing steps and training the model: + +```bash +export MAX_LENGTH=128 +export BERT_MODEL=bert-base-multilingual-cased +``` + +Run the pre-processing script on training, dev and test datasets: + +```bash +python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt +python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt +python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt +``` + +The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used: + +```bash +cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt +``` + +### Training + +Additional environment variables must be set: + +```bash +export OUTPUT_DIR=germeval-model +export BATCH_SIZE=32 +export NUM_EPOCHS=3 +export SAVE_STEPS=750 +export SEED=1 +``` + +To start training, just run: + +```bash +python3 run_ner.py --data_dir ./ \ +--model_type bert \ +--labels ./labels.txt \ +--model_name_or_path $BERT_MODEL \ +--output_dir $OUTPUT_DIR \ +--max_seq_length $MAX_LENGTH \ +--num_train_epochs $NUM_EPOCHS \ +--per_gpu_train_batch_size $BATCH_SIZE \ +--save_steps $SAVE_STEPS \ +--seed $SEED \ +--do_train \ +--do_eval \ +--do_predict +``` + +If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets. + +### Evaluation + +Evaluation on development dataset outputs the following for our example: + +```bash +10/04/2019 00:42:06 - INFO - __main__ - ***** Eval results ***** +10/04/2019 00:42:06 - INFO - __main__ - f1 = 0.8623348017621146 +10/04/2019 00:42:06 - INFO - __main__ - loss = 0.07183869666975543 +10/04/2019 00:42:06 - INFO - __main__ - precision = 0.8467916366258111 +10/04/2019 00:42:06 - INFO - __main__ - recall = 0.8784592370979806 +``` + +On the test dataset the following results could be achieved: + +```bash +10/04/2019 00:42:42 - INFO - __main__ - ***** Eval results ***** +10/04/2019 00:42:42 - INFO - __main__ - f1 = 0.8614389652384803 +10/04/2019 00:42:42 - INFO - __main__ - loss = 0.07064602487454782 +10/04/2019 00:42:42 - INFO - __main__ - precision = 0.8604651162790697 +10/04/2019 00:42:42 - INFO - __main__ - recall = 0.8624150210424085 +```