Reorganize examples (#9010)
* Reorganize example folder * Continue reorganization * Change requirements for tests * Final cleanup * Finish regroup with tests all passing * Copyright * Requirements and readme * Make a full link for the documentation * Address review comments * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Add symlink * Reorg again * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Adapt title * Update to new strucutre * Remove test * Update READMEs Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
This commit is contained in:
@@ -1,3 +1,19 @@
|
||||
<!---
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
## Token classification
|
||||
|
||||
Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech
|
||||
@@ -32,10 +48,16 @@ python run_ner.py \
|
||||
--do_eval
|
||||
```
|
||||
|
||||
**Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
|
||||
uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
|
||||
[this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version
|
||||
of the script.
|
||||
|
||||
## Old version of the script
|
||||
|
||||
Based on the scripts [`run_ner_old.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py) for Pytorch and
|
||||
[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_tf_ner.py) for Tensorflow 2.
|
||||
You can find the old version of the PyTorch script [here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/token-classification/run_ner_old.py).
|
||||
|
||||
### TensorFlow version
|
||||
|
||||
The following examples are covered in this section:
|
||||
|
||||
@@ -98,79 +120,6 @@ export SAVE_STEPS=750
|
||||
export SEED=1
|
||||
```
|
||||
|
||||
#### Run the Pytorch version
|
||||
|
||||
To start training, just run:
|
||||
|
||||
```bash
|
||||
python3 run_ner_old.py --data_dir ./ \
|
||||
--labels ./labels.txt \
|
||||
--model_name_or_path $BERT_MODEL \
|
||||
--output_dir $OUTPUT_DIR \
|
||||
--max_seq_length $MAX_LENGTH \
|
||||
--num_train_epochs $NUM_EPOCHS \
|
||||
--per_device_train_batch_size $BATCH_SIZE \
|
||||
--save_steps $SAVE_STEPS \
|
||||
--seed $SEED \
|
||||
--do_train \
|
||||
--do_eval \
|
||||
--do_predict
|
||||
```
|
||||
|
||||
If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
|
||||
|
||||
#### JSON-based configuration file
|
||||
|
||||
Instead of passing all parameters via commandline arguments, the `run_ner_old.py` script also supports reading parameters from a json-based configuration file:
|
||||
|
||||
```json
|
||||
{
|
||||
"data_dir": ".",
|
||||
"labels": "./labels.txt",
|
||||
"model_name_or_path": "bert-base-multilingual-cased",
|
||||
"output_dir": "germeval-model",
|
||||
"max_seq_length": 128,
|
||||
"num_train_epochs": 3,
|
||||
"per_device_train_batch_size": 32,
|
||||
"save_steps": 750,
|
||||
"seed": 1,
|
||||
"do_train": true,
|
||||
"do_eval": true,
|
||||
"do_predict": true
|
||||
}
|
||||
```
|
||||
|
||||
It must be saved with a `.json` extension and can be used by running `python3 run_ner_old.py config.json`.
|
||||
|
||||
#### Evaluation
|
||||
|
||||
Evaluation on development dataset outputs the following for our example:
|
||||
|
||||
```bash
|
||||
10/04/2019 00:42:06 - INFO - __main__ - ***** Eval results *****
|
||||
10/04/2019 00:42:06 - INFO - __main__ - f1 = 0.8623348017621146
|
||||
10/04/2019 00:42:06 - INFO - __main__ - loss = 0.07183869666975543
|
||||
10/04/2019 00:42:06 - INFO - __main__ - precision = 0.8467916366258111
|
||||
10/04/2019 00:42:06 - INFO - __main__ - recall = 0.8784592370979806
|
||||
```
|
||||
|
||||
On the test dataset the following results could be achieved:
|
||||
|
||||
```bash
|
||||
10/04/2019 00:42:42 - INFO - __main__ - ***** Eval results *****
|
||||
10/04/2019 00:42:42 - INFO - __main__ - f1 = 0.8614389652384803
|
||||
10/04/2019 00:42:42 - INFO - __main__ - loss = 0.07064602487454782
|
||||
10/04/2019 00:42:42 - INFO - __main__ - precision = 0.8604651162790697
|
||||
10/04/2019 00:42:42 - INFO - __main__ - recall = 0.8624150210424085
|
||||
```
|
||||
|
||||
#### Run PyTorch version using PyTorch-Lightning
|
||||
|
||||
Run `bash run_pl.sh` from the `ner` directory. This would also install `pytorch-lightning` and the `examples/requirements.txt`. It is a shell pipeline which would automatically download, pre-process the data and run the models in `germeval-model` directory. Logs are saved in `lightning_logs` directory.
|
||||
|
||||
Pass `--gpus` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: `TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}`
|
||||
|
||||
|
||||
#### Run the Tensorflow 2 version
|
||||
|
||||
To start training, just run:
|
||||
@@ -235,102 +184,3 @@ On the test dataset the following results could be achieved:
|
||||
micro avg 0.8722 0.8774 0.8748 13869
|
||||
macro avg 0.8712 0.8774 0.8740 13869
|
||||
```
|
||||
|
||||
### Emerging and Rare Entities task: WNUT’17 (English NER) dataset
|
||||
|
||||
Description of the WNUT’17 task from the [shared task website](http://noisy-text.github.io/2017/index.html):
|
||||
|
||||
> The WNUT’17 shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions.
|
||||
> Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on
|
||||
> them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms.
|
||||
|
||||
Six labels are available in the dataset. An overview can be found on this [page](http://noisy-text.github.io/2017/files/).
|
||||
|
||||
#### Data (Download and pre-processing steps)
|
||||
|
||||
The dataset can be downloaded from the [official GitHub](https://github.com/leondz/emerging_entities_17) repository.
|
||||
|
||||
The following commands show how to prepare the dataset for fine-tuning:
|
||||
|
||||
```bash
|
||||
mkdir -p data_wnut_17
|
||||
|
||||
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/wnut17train.conll' | tr '\t' ' ' > data_wnut_17/train.txt.tmp
|
||||
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/emerging.dev.conll' | tr '\t' ' ' > data_wnut_17/dev.txt.tmp
|
||||
curl -L 'https://raw.githubusercontent.com/leondz/emerging_entities_17/master/emerging.test.annotated' | tr '\t' ' ' > data_wnut_17/test.txt.tmp
|
||||
```
|
||||
|
||||
Let's define some variables that we need for further pre-processing steps:
|
||||
|
||||
```bash
|
||||
export MAX_LENGTH=128
|
||||
export BERT_MODEL=bert-large-cased
|
||||
```
|
||||
|
||||
Here we use the English BERT large model for fine-tuning.
|
||||
The `preprocess.py` scripts splits longer sentences into smaller ones (once the max. subtoken length is reached):
|
||||
|
||||
```bash
|
||||
python3 scripts/preprocess.py data_wnut_17/train.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/train.txt
|
||||
python3 scripts/preprocess.py data_wnut_17/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/dev.txt
|
||||
python3 scripts/preprocess.py data_wnut_17/test.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/test.txt
|
||||
```
|
||||
|
||||
In the last pre-processing step, the `labels.txt` file needs to be generated. This file contains all available labels:
|
||||
|
||||
```bash
|
||||
cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > data_wnut_17/labels.txt
|
||||
```
|
||||
|
||||
#### Run the Pytorch version
|
||||
|
||||
Fine-tuning with the PyTorch version can be started using the `run_ner_old.py` script. In this example we use a JSON-based configuration file.
|
||||
|
||||
This configuration file looks like:
|
||||
|
||||
```json
|
||||
{
|
||||
"data_dir": "./data_wnut_17",
|
||||
"labels": "./data_wnut_17/labels.txt",
|
||||
"model_name_or_path": "bert-large-cased",
|
||||
"output_dir": "wnut-17-model-1",
|
||||
"max_seq_length": 128,
|
||||
"num_train_epochs": 3,
|
||||
"per_device_train_batch_size": 32,
|
||||
"save_steps": 425,
|
||||
"seed": 1,
|
||||
"do_train": true,
|
||||
"do_eval": true,
|
||||
"do_predict": true,
|
||||
"fp16": false
|
||||
}
|
||||
```
|
||||
|
||||
If your GPU supports half-precision training, please set `fp16` to `true`.
|
||||
|
||||
Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner_old.py wnut_17.json`.
|
||||
|
||||
#### Evaluation
|
||||
|
||||
Evaluation on development dataset outputs the following:
|
||||
|
||||
```bash
|
||||
05/29/2020 23:33:44 - INFO - __main__ - ***** Eval results *****
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_loss = 0.26505235286212275
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_precision = 0.7008264462809918
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_recall = 0.507177033492823
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_f1 = 0.5884802220680084
|
||||
05/29/2020 23:33:44 - INFO - __main__ - epoch = 3.0
|
||||
```
|
||||
|
||||
On the test dataset the following results could be achieved:
|
||||
|
||||
```bash
|
||||
05/29/2020 23:33:44 - INFO - transformers.trainer - ***** Running Prediction *****
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_loss = 0.30948806500973547
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_precision = 0.5840108401084011
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_recall = 0.3994439295644115
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_f1 = 0.47440836543753434
|
||||
```
|
||||
|
||||
WNUT’17 is a very difficult task. Current state-of-the-art results on this dataset can be found [here](http://nlpprogress.com/english/named_entity_recognition.html).
|
||||
|
||||
Reference in New Issue
Block a user