Reorganize examples (#9010)
* Reorganize example folder * Continue reorganization * Change requirements for tests * Final cleanup * Finish regroup with tests all passing * Copyright * Requirements and readme * Make a full link for the documentation * Address review comments * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Add symlink * Reorg again * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Adapt title * Update to new strucutre * Remove test * Update READMEs Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
This commit is contained in:
@@ -309,7 +309,7 @@ jobs:
|
|||||||
- v0.4-{{ checksum "setup.py" }}
|
- v0.4-{{ checksum "setup.py" }}
|
||||||
- run: pip install --upgrade pip
|
- run: pip install --upgrade pip
|
||||||
- run: pip install .[sklearn,torch,sentencepiece,testing]
|
- run: pip install .[sklearn,torch,sentencepiece,testing]
|
||||||
- run: pip install -r examples/requirements.txt
|
- run: pip install -r examples/_tests_requirements.txt
|
||||||
- save_cache:
|
- save_cache:
|
||||||
key: v0.4-torch_examples-{{ checksum "setup.py" }}
|
key: v0.4-torch_examples-{{ checksum "setup.py" }}
|
||||||
paths:
|
paths:
|
||||||
|
|||||||
@@ -16,59 +16,58 @@ limitations under the License.
|
|||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
|
|
||||||
Version 2.9 of 🤗 Transformers introduced a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
|
This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to
|
||||||
Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.2+.
|
be in this folder, it may have moved to our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects).
|
||||||
|
|
||||||
Here is the list of all our examples:
|
|
||||||
- **grouped by task** (all official examples work for multiple models)
|
|
||||||
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
|
|
||||||
just lack some features),
|
|
||||||
- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
|
|
||||||
- links to **Colab notebooks** to walk through the scripts and run them easily,
|
|
||||||
- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
|
|
||||||
|
|
||||||
|
|
||||||
## Important note
|
## Important note
|
||||||
|
|
||||||
**Important**
|
**Important**
|
||||||
|
|
||||||
To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements.
|
To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
|
||||||
Execute the following steps in a new virtual environment:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/huggingface/transformers
|
git clone https://github.com/huggingface/transformers
|
||||||
cd transformers
|
cd transformers
|
||||||
pip install .
|
pip install .
|
||||||
pip install -r ./examples/requirements.txt
|
```
|
||||||
|
Then cd in the example folder of your choice and run
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
Alternatively, you can run the version of the examples as they were for your current version of Transformers via (for instance with v3.4.0):
|
Alternatively, you can run the version of the examples as they were for your current version of Transformers via (for instance with v3.5.1):
|
||||||
```bash
|
```bash
|
||||||
git checkout tags/v3.4.0
|
git checkout tags/v3.5.1
|
||||||
```
|
```
|
||||||
|
|
||||||
## The Big Table of Tasks
|
## The Big Table of Tasks
|
||||||
|
|
||||||
|
Here is the list of all our examples:
|
||||||
|
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
|
||||||
|
just lack some features),
|
||||||
|
- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
|
||||||
|
- links to **Colab notebooks** to walk through the scripts and run them easily,
|
||||||
|
<!--
|
||||||
|
Coming soon!
|
||||||
|
- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
|
||||||
|
-->
|
||||||
|
|
||||||
| Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
|
| Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
|
||||||
|---|---|:---:|:---:|:---:|:---:|
|
|---|---|:---:|:---:|:---:|:---:|
|
||||||
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
|
||||||
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
|
|
||||||
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
|
|
||||||
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
|
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
|
||||||
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | -
|
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | -
|
||||||
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
|
|
||||||
| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation) | All | - | - | - | -
|
|
||||||
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
|
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
|
||||||
|
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
|
||||||
|
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
|
||||||
|
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
|
||||||
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -
|
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -
|
||||||
| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology) | - | - | - | - | -
|
|
||||||
| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial) | HANS | ✅ | - | - | -
|
|
||||||
|
|
||||||
|
|
||||||
<br>
|
<!--
|
||||||
|
|
||||||
## One-click Deploy to Cloud (wip)
|
## One-click Deploy to Cloud (wip)
|
||||||
|
|
||||||
**Coming soon!**
|
**Coming soon!**
|
||||||
|
-->
|
||||||
|
|
||||||
## Running on TPUs
|
## Running on TPUs
|
||||||
|
|
||||||
|
|||||||
20
examples/_tests_requirements.txt
Normal file
20
examples/_tests_requirements.txt
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
tensorboard
|
||||||
|
scikit-learn
|
||||||
|
seqeval
|
||||||
|
psutil
|
||||||
|
sacrebleu
|
||||||
|
rouge-score
|
||||||
|
tensorflow_datasets
|
||||||
|
matplotlib
|
||||||
|
git-python==1.0.3
|
||||||
|
faiss-cpu
|
||||||
|
streamlit
|
||||||
|
elasticsearch
|
||||||
|
nltk
|
||||||
|
pandas
|
||||||
|
datasets >= 1.1.3
|
||||||
|
fire
|
||||||
|
pytest
|
||||||
|
conllu
|
||||||
|
sentencepiece != 0.1.92
|
||||||
|
protobuf
|
||||||
@@ -1,3 +1,19 @@
|
|||||||
|
<!---
|
||||||
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
# 🤗 Benchmark results
|
# 🤗 Benchmark results
|
||||||
|
|
||||||
Here, you can find a list of the different benchmark results created by the community.
|
Here, you can find a list of the different benchmark results created by the community.
|
||||||
|
|||||||
@@ -1,3 +1,17 @@
|
|||||||
|
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
import csv
|
import csv
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
|
|||||||
@@ -1,3 +1,17 @@
|
|||||||
|
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
# tests directory-specific settings - this file is run automatically
|
# tests directory-specific settings - this file is run automatically
|
||||||
# by pytest before any tests are run
|
# by pytest before any tests are run
|
||||||
|
|
||||||
|
|||||||
@@ -1,5 +0,0 @@
|
|||||||
# Community contributed examples
|
|
||||||
|
|
||||||
This folder contains examples which are not actively maintained (mostly contributed by the community).
|
|
||||||
|
|
||||||
Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
|
|
||||||
@@ -1,3 +1,19 @@
|
|||||||
|
<!---
|
||||||
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
## Language model training
|
## Language model training
|
||||||
|
|
||||||
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
|
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
|
||||||
|
|||||||
3
examples/language-modeling/requirements.txt
Normal file
3
examples/language-modeling/requirements.txt
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
datasets >= 1.1.3
|
||||||
|
sentencepiece != 0.1.92
|
||||||
|
protobuf
|
||||||
21
examples/legacy/README.md
Normal file
21
examples/legacy/README.md
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
<!---
|
||||||
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Legacy examples
|
||||||
|
|
||||||
|
This folder contains examples which are not actively maintained (mostly contributed by the community).
|
||||||
|
|
||||||
|
Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
|
||||||
@@ -21,7 +21,7 @@ mkdir -p $OUTPUT_DIR
|
|||||||
# Add parent directory to python path to access lightning_base.py
|
# Add parent directory to python path to access lightning_base.py
|
||||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||||
|
|
||||||
python3 run_pl_glue.py --gpus 1 --data_dir $DATA_DIR \
|
python3 run_glue.py --gpus 1 --data_dir $DATA_DIR \
|
||||||
--task $TASK \
|
--task $TASK \
|
||||||
--model_name_or_path $BERT_MODEL \
|
--model_name_or_path $BERT_MODEL \
|
||||||
--output_dir $OUTPUT_DIR \
|
--output_dir $OUTPUT_DIR \
|
||||||
@@ -31,7 +31,7 @@ mkdir -p $OUTPUT_DIR
|
|||||||
# Add parent directory to python path to access lightning_base.py
|
# Add parent directory to python path to access lightning_base.py
|
||||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||||
|
|
||||||
python3 run_pl_ner.py --data_dir ./ \
|
python3 run_ner.py --data_dir ./ \
|
||||||
--labels ./labels.txt \
|
--labels ./labels.txt \
|
||||||
--model_name_or_path $BERT_MODEL \
|
--model_name_or_path $BERT_MODEL \
|
||||||
--output_dir $OUTPUT_DIR \
|
--output_dir $OUTPUT_DIR \
|
||||||
@@ -26,7 +26,7 @@ export SEED=1
|
|||||||
# Add parent directory to python path to access lightning_base.py
|
# Add parent directory to python path to access lightning_base.py
|
||||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||||
|
|
||||||
python3 run_pl_ner.py --data_dir ./ \
|
python3 run_ner.py --data_dir ./ \
|
||||||
--task_type POS \
|
--task_type POS \
|
||||||
--model_name_or_path $BERT_MODEL \
|
--model_name_or_path $BERT_MODEL \
|
||||||
--output_dir $OUTPUT_DIR \
|
--output_dir $OUTPUT_DIR \
|
||||||
229
examples/legacy/token-classification/README.md
Normal file
229
examples/legacy/token-classification/README.md
Normal file
@@ -0,0 +1,229 @@
|
|||||||
|
## Token classification
|
||||||
|
|
||||||
|
Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/token-classification/run_ner.py).
|
||||||
|
|
||||||
|
The following examples are covered in this section:
|
||||||
|
|
||||||
|
* NER on the GermEval 2014 (German NER) dataset
|
||||||
|
* Emerging and Rare Entities task: WNUT’17 (English NER) dataset
|
||||||
|
|
||||||
|
Details and results for the fine-tuning provided by @stefan-it.
|
||||||
|
|
||||||
|
### GermEval 2014 (German NER) dataset
|
||||||
|
|
||||||
|
#### Data (Download and pre-processing steps)
|
||||||
|
|
||||||
|
Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
|
||||||
|
|
||||||
|
Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
|
||||||
|
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
|
||||||
|
curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
|
||||||
|
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
|
||||||
|
curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
|
||||||
|
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
|
||||||
|
```
|
||||||
|
|
||||||
|
The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`.
|
||||||
|
One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s.
|
||||||
|
The `preprocess.py` script located in the `scripts` folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
|
||||||
|
|
||||||
|
Let's define some variables that we need for further pre-processing steps and training the model:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export MAX_LENGTH=128
|
||||||
|
export BERT_MODEL=bert-base-multilingual-cased
|
||||||
|
```
|
||||||
|
|
||||||
|
Run the pre-processing script on training, dev and test datasets:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
|
||||||
|
python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
|
||||||
|
python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Prepare the run
|
||||||
|
|
||||||
|
Additional environment variables must be set:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export OUTPUT_DIR=germeval-model
|
||||||
|
export BATCH_SIZE=32
|
||||||
|
export NUM_EPOCHS=3
|
||||||
|
export SAVE_STEPS=750
|
||||||
|
export SEED=1
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Run the Pytorch version
|
||||||
|
|
||||||
|
To start training, just run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 run_ner.py --data_dir ./ \
|
||||||
|
--labels ./labels.txt \
|
||||||
|
--model_name_or_path $BERT_MODEL \
|
||||||
|
--output_dir $OUTPUT_DIR \
|
||||||
|
--max_seq_length $MAX_LENGTH \
|
||||||
|
--num_train_epochs $NUM_EPOCHS \
|
||||||
|
--per_device_train_batch_size $BATCH_SIZE \
|
||||||
|
--save_steps $SAVE_STEPS \
|
||||||
|
--seed $SEED \
|
||||||
|
--do_train \
|
||||||
|
--do_eval \
|
||||||
|
--do_predict
|
||||||
|
```
|
||||||
|
|
||||||
|
If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
|
||||||
|
|
||||||
|
#### JSON-based configuration file
|
||||||
|
|
||||||
|
Instead of passing all parameters via commandline arguments, the `run_ner.py` script also supports reading parameters from a json-based configuration file:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"data_dir": ".",
|
||||||
|
"labels": "./labels.txt",
|
||||||
|
"model_name_or_path": "bert-base-multilingual-cased",
|
||||||
|
"output_dir": "germeval-model",
|
||||||
|
"max_seq_length": 128,
|
||||||
|
"num_train_epochs": 3,
|
||||||
|
"per_device_train_batch_size": 32,
|
||||||
|
"save_steps": 750,
|
||||||
|
"seed": 1,
|
||||||
|
"do_train": true,
|
||||||
|
"do_eval": true,
|
||||||
|
"do_predict": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
It must be saved with a `.json` extension and can be used by running `python3 run_ner.py config.json`.
|
||||||
|
|
||||||
|
#### Evaluation
|
||||||
|
|
||||||
|
Evaluation on development dataset outputs the following for our example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - ***** Eval results *****
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - f1 = 0.8623348017621146
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - loss = 0.07183869666975543
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - precision = 0.8467916366258111
|
||||||
|
10/04/2019 00:42:06 - INFO - __main__ - recall = 0.8784592370979806
|
||||||
|
```
|
||||||
|
|
||||||
|
On the test dataset the following results could be achieved:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - ***** Eval results *****
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - f1 = 0.8614389652384803
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - loss = 0.07064602487454782
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - precision = 0.8604651162790697
|
||||||
|
10/04/2019 00:42:42 - INFO - __main__ - recall = 0.8624150210424085
|
||||||
|
```
|
||||||
|
|
||||||
|
### Emerging and Rare Entities task: WNUT’17 (English NER) dataset
|
||||||
|
|
||||||
|
Description of the WNUT’17 task from the [shared task website](http://noisy-text.github.io/2017/index.html):
|
||||||
|
|
||||||
|
> The WNUT’17 shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions.
|
||||||
|
> Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on
|
||||||
|
> them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms.
|
||||||
|
|
||||||
|
Six labels are available in the dataset. An overview can be found on this [page](http://noisy-text.github.io/2017/files/).
|
||||||
|
|
||||||
|
#### Data (Download and pre-processing steps)
|
||||||
|
|
||||||
|
The dataset can be downloaded from the [official GitHub](https://github.com/leondz/emerging_entities_17) repository.
|
||||||
|
|
||||||
|
The following commands show how to prepare the dataset for fine-tuning:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p data_wnut_17
|
||||||
|
|
||||||
|
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/wnut17train.conll' | tr '\t' ' ' > data_wnut_17/train.txt.tmp
|
||||||
|
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/emerging.dev.conll' | tr '\t' ' ' > data_wnut_17/dev.txt.tmp
|
||||||
|
curl -L 'https://raw.githubusercontent.com/leondz/emerging_entities_17/master/emerging.test.annotated' | tr '\t' ' ' > data_wnut_17/test.txt.tmp
|
||||||
|
```
|
||||||
|
|
||||||
|
Let's define some variables that we need for further pre-processing steps:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export MAX_LENGTH=128
|
||||||
|
export BERT_MODEL=bert-large-cased
|
||||||
|
```
|
||||||
|
|
||||||
|
Here we use the English BERT large model for fine-tuning.
|
||||||
|
The `preprocess.py` scripts splits longer sentences into smaller ones (once the max. subtoken length is reached):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/preprocess.py data_wnut_17/train.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/train.txt
|
||||||
|
python3 scripts/preprocess.py data_wnut_17/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/dev.txt
|
||||||
|
python3 scripts/preprocess.py data_wnut_17/test.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/test.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
In the last pre-processing step, the `labels.txt` file needs to be generated. This file contains all available labels:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > data_wnut_17/labels.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Run the Pytorch version
|
||||||
|
|
||||||
|
Fine-tuning with the PyTorch version can be started using the `run_ner.py` script. In this example we use a JSON-based configuration file.
|
||||||
|
|
||||||
|
This configuration file looks like:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"data_dir": "./data_wnut_17",
|
||||||
|
"labels": "./data_wnut_17/labels.txt",
|
||||||
|
"model_name_or_path": "bert-large-cased",
|
||||||
|
"output_dir": "wnut-17-model-1",
|
||||||
|
"max_seq_length": 128,
|
||||||
|
"num_train_epochs": 3,
|
||||||
|
"per_device_train_batch_size": 32,
|
||||||
|
"save_steps": 425,
|
||||||
|
"seed": 1,
|
||||||
|
"do_train": true,
|
||||||
|
"do_eval": true,
|
||||||
|
"do_predict": true,
|
||||||
|
"fp16": false
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
If your GPU supports half-precision training, please set `fp16` to `true`.
|
||||||
|
|
||||||
|
Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner_old.py wnut_17.json`.
|
||||||
|
|
||||||
|
#### Evaluation
|
||||||
|
|
||||||
|
Evaluation on development dataset outputs the following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
05/29/2020 23:33:44 - INFO - __main__ - ***** Eval results *****
|
||||||
|
05/29/2020 23:33:44 - INFO - __main__ - eval_loss = 0.26505235286212275
|
||||||
|
05/29/2020 23:33:44 - INFO - __main__ - eval_precision = 0.7008264462809918
|
||||||
|
05/29/2020 23:33:44 - INFO - __main__ - eval_recall = 0.507177033492823
|
||||||
|
05/29/2020 23:33:44 - INFO - __main__ - eval_f1 = 0.5884802220680084
|
||||||
|
05/29/2020 23:33:44 - INFO - __main__ - epoch = 3.0
|
||||||
|
```
|
||||||
|
|
||||||
|
On the test dataset the following results could be achieved:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
05/29/2020 23:33:44 - INFO - transformers.trainer - ***** Running Prediction *****
|
||||||
|
05/29/2020 23:34:02 - INFO - __main__ - eval_loss = 0.30948806500973547
|
||||||
|
05/29/2020 23:34:02 - INFO - __main__ - eval_precision = 0.5840108401084011
|
||||||
|
05/29/2020 23:34:02 - INFO - __main__ - eval_recall = 0.3994439295644115
|
||||||
|
05/29/2020 23:34:02 - INFO - __main__ - eval_f1 = 0.47440836543753434
|
||||||
|
```
|
||||||
|
|
||||||
|
WNUT’17 is a very difficult task. Current state-of-the-art results on this dataset can be found [here](http://nlpprogress.com/english/named_entity_recognition.html).
|
||||||
@@ -20,7 +20,7 @@ export NUM_EPOCHS=3
|
|||||||
export SAVE_STEPS=750
|
export SAVE_STEPS=750
|
||||||
export SEED=1
|
export SEED=1
|
||||||
|
|
||||||
python3 run_ner_old.py \
|
python3 run_ner.py \
|
||||||
--task_type NER \
|
--task_type NER \
|
||||||
--data_dir . \
|
--data_dir . \
|
||||||
--labels ./labels.txt \
|
--labels ./labels.txt \
|
||||||
@@ -21,7 +21,7 @@ export NUM_EPOCHS=3
|
|||||||
export SAVE_STEPS=750
|
export SAVE_STEPS=750
|
||||||
export SEED=1
|
export SEED=1
|
||||||
|
|
||||||
python3 run_ner_old.py \
|
python3 run_ner.py \
|
||||||
--task_type Chunk \
|
--task_type Chunk \
|
||||||
--data_dir . \
|
--data_dir . \
|
||||||
--model_name_or_path $BERT_MODEL \
|
--model_name_or_path $BERT_MODEL \
|
||||||
@@ -21,7 +21,7 @@ export NUM_EPOCHS=3
|
|||||||
export SAVE_STEPS=750
|
export SAVE_STEPS=750
|
||||||
export SEED=1
|
export SEED=1
|
||||||
|
|
||||||
python3 run_ner_old.py \
|
python3 run_ner.py \
|
||||||
--task_type POS \
|
--task_type POS \
|
||||||
--data_dir . \
|
--data_dir . \
|
||||||
--model_name_or_path $BERT_MODEL \
|
--model_name_or_path $BERT_MODEL \
|
||||||
@@ -1,3 +1,19 @@
|
|||||||
|
<!---
|
||||||
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
## Multiple Choice
|
## Multiple Choice
|
||||||
|
|
||||||
Based on the script [`run_multiple_choice.py`]().
|
Based on the script [`run_multiple_choice.py`]().
|
||||||
|
|||||||
2
examples/multiple-choice/requirements.txt
Normal file
2
examples/multiple-choice/requirements.txt
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
sentencepiece != 0.1.92
|
||||||
|
protobuf
|
||||||
@@ -1,8 +1,29 @@
|
|||||||
|
<!---
|
||||||
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
## SQuAD
|
## SQuAD
|
||||||
|
|
||||||
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).
|
Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py).
|
||||||
|
|
||||||
|
**Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
|
||||||
|
uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
|
||||||
|
[this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version
|
||||||
|
of the script.
|
||||||
|
|
||||||
|
The old version of this script can be found [here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/question-answering/run_squad.py).
|
||||||
|
|
||||||
#### Fine-tuning BERT on SQuAD1.0
|
#### Fine-tuning BERT on SQuAD1.0
|
||||||
|
|
||||||
|
|||||||
1
examples/question-answering/requirements.txt
Normal file
1
examples/question-answering/requirements.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
datasets >= 1.1.3
|
||||||
28
examples/research_projects/README.md
Normal file
28
examples/research_projects/README.md
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
<!---
|
||||||
|
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Research projects
|
||||||
|
|
||||||
|
This folder contains various research projects using 🤗 Transformers. They are not maintained and require a specific
|
||||||
|
version of 🤗 Transformers that is indicated in the requirements file of each folder. Updating them to the most recent version of the library will require some work.
|
||||||
|
|
||||||
|
To use any of them, just run the command
|
||||||
|
```
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
inside the folder of your choice.
|
||||||
|
|
||||||
|
If you need help with any of those, contact the author(s), indicated at the top of the `README` of each folder.
|
||||||
1
examples/research_projects/adversarial/requirements.txt
Normal file
1
examples/research_projects/adversarial/requirements.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
transformers == 3.5.1
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
transformers == 3.5.1
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
transformers
|
transformers == 3.5.1
|
||||||
|
|
||||||
# For ROUGE
|
# For ROUGE
|
||||||
nltk
|
nltk
|
||||||
1
examples/research_projects/bertology/requirements.txt
Normal file
1
examples/research_projects/bertology/requirements.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
transformers == 3.5.1
|
||||||
1
examples/research_projects/deebert/requirements.txt
Normal file
1
examples/research_projects/deebert/requirements.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
transformers == 3.5.1
|
||||||
0
examples/research_projects/deebert/src/__init__.py
Normal file
0
examples/research_projects/deebert/src/__init__.py
Normal file
@@ -1,5 +1,7 @@
|
|||||||
# Distil*
|
# Distil*
|
||||||
|
|
||||||
|
Author: @VictorSanh
|
||||||
|
|
||||||
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2.
|
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2.
|
||||||
|
|
||||||
**January 20, 2020 - Bug fixing** We have recently discovered and fixed [a bug](https://github.com/huggingface/transformers/commit/48cbf267c988b56c71a2380f748a3e6092ccaed3) in the evaluation of our `run_*.py` scripts that caused the reported metrics to be over-estimated on average. We have updated all the metrics with the latest runs.
|
**January 20, 2020 - Bug fixing** We have recently discovered and fixed [a bug](https://github.com/huggingface/transformers/commit/48cbf267c988b56c71a2380f748a3e6092ccaed3) in the evaluation of our `run_*.py` scripts that caused the reported metrics to be over-estimated on average. We have updated all the metrics with the latest runs.
|
||||||
@@ -1,5 +1,7 @@
|
|||||||
# Long Form Question Answering
|
# Long Form Question Answering
|
||||||
|
|
||||||
|
Author: @yjernite
|
||||||
|
|
||||||
This folder contains the code for the Long Form Question answering [demo](http://35.226.96.115:8080/) as well as methods to train and use a fully end-to-end Long Form Question Answering system using the [🤗transformers](https://github.com/huggingface/transformers) and [🤗datasets](https://github.com/huggingface/datasets) libraries.
|
This folder contains the code for the Long Form Question answering [demo](http://35.226.96.115:8080/) as well as methods to train and use a fully end-to-end Long Form Question Answering system using the [🤗transformers](https://github.com/huggingface/transformers) and [🤗datasets](https://github.com/huggingface/datasets) libraries.
|
||||||
|
|
||||||
You can use these methods to train your own system by following along the associate [notebook](https://github.com/huggingface/notebooks/blob/master/longform-qa/Long_Form_Question_Answering_with_ELI5_and_Wikipedia.ipynb) or [blog post](https://yjernite.github.io/lfqa.html).
|
You can use these methods to train your own system by following along the associate [notebook](https://github.com/huggingface/notebooks/blob/master/longform-qa/Long_Form_Question_Answering_with_ELI5_and_Wikipedia.ipynb) or [blog post](https://yjernite.github.io/lfqa.html).
|
||||||
4
examples/research_projects/longform-qa/requirements.txt
Normal file
4
examples/research_projects/longform-qa/requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
datasets >= 1.1.3
|
||||||
|
faiss-cpu
|
||||||
|
streamlit
|
||||||
|
elasticsearch
|
||||||
@@ -1,5 +1,7 @@
|
|||||||
# Movement Pruning: Adaptive Sparsity by Fine-Tuning
|
# Movement Pruning: Adaptive Sparsity by Fine-Tuning
|
||||||
|
|
||||||
|
Author: @VictorSanh
|
||||||
|
|
||||||
*Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of *movement pruning*, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:*
|
*Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of *movement pruning*, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:*
|
||||||
|
|
||||||
| Fine-pruning+Distillation<br>(Teacher=BERT-base fine-tuned) | BERT base<br>fine-tuned | Remaining<br>Weights (%) | Magnitude Pruning | L0 Regularization | Movement Pruning | Soft Movement Pruning |
|
| Fine-pruning+Distillation<br>(Teacher=BERT-base fine-tuned) | BERT base<br>fine-tuned | Remaining<br>Weights (%) | Magnitude Pruning | L0 Regularization | Movement Pruning | Soft Movement Pruning |
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user