Reorganize examples (#9010)

* Reorganize example folder * Continue reorganization * Change requirements for tests * Final cleanup * Finish regroup with tests all passing * Copyright * Requirements and readme * Make a full link for the documentation * Address review comments * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Add symlink * Reorg again * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Adapt title * Update to new strucutre * Remove test * Update READMEs Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
2020-12-11 10:07:02 -05:00
parent 86896de064
commit 783d7d2629
215 changed files with 4454 additions and 1193 deletions
--- a/examples/token-classification/README.md
+++ b/examples/token-classification/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 ## Token classification

 Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech
@@ -32,10 +48,16 @@ python run_ner.py \
  --do_eval
 ```

+**Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
+uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
+[this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version
+of the script.
+
 ## Old version of the script

-Based on the scripts [`run_ner_old.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py) for Pytorch and
-[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_tf_ner.py) for Tensorflow 2.
+You can find the old version of the PyTorch script [here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/token-classification/run_ner_old.py).
+
+### TensorFlow version

 The following examples are covered in this section:

@@ -98,79 +120,6 @@ export SAVE_STEPS=750
 export SEED=1
 ```

-#### Run the Pytorch version
-
-To start training, just run:
-
-```bash
-python3 run_ner_old.py --data_dir ./ \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
-```
-
-If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
-
-#### JSON-based configuration file
-
-Instead of passing all parameters via commandline arguments, the `run_ner_old.py` script also supports reading parameters from a json-based configuration file:
-
-```json
-{
-    "data_dir": ".",
-    "labels": "./labels.txt",
-    "model_name_or_path": "bert-base-multilingual-cased",
-    "output_dir": "germeval-model",
-    "max_seq_length": 128,
-    "num_train_epochs": 3,
-    "per_device_train_batch_size": 32,
-    "save_steps": 750,
-    "seed": 1,
-    "do_train": true,
-    "do_eval": true,
-    "do_predict": true
-}
-```
-
-It must be saved with a `.json` extension and can be used by running `python3 run_ner_old.py config.json`.
-
-#### Evaluation
-
-Evaluation on development dataset outputs the following for our example:
-
-```bash
-10/04/2019 00:42:06 - INFO - __main__ -   ***** Eval results  *****
-10/04/2019 00:42:06 - INFO - __main__ -     f1 = 0.8623348017621146
-10/04/2019 00:42:06 - INFO - __main__ -     loss = 0.07183869666975543
-10/04/2019 00:42:06 - INFO - __main__ -     precision = 0.8467916366258111
-10/04/2019 00:42:06 - INFO - __main__ -     recall = 0.8784592370979806
-```
-
-On the test dataset the following results could be achieved:
-
-```bash
-10/04/2019 00:42:42 - INFO - __main__ -   ***** Eval results  *****
-10/04/2019 00:42:42 - INFO - __main__ -     f1 = 0.8614389652384803
-10/04/2019 00:42:42 - INFO - __main__ -     loss = 0.07064602487454782
-10/04/2019 00:42:42 - INFO - __main__ -     precision = 0.8604651162790697
-10/04/2019 00:42:42 - INFO - __main__ -     recall = 0.8624150210424085
-```
-
-#### Run PyTorch version using PyTorch-Lightning
-
-Run `bash run_pl.sh` from the `ner` directory. This would also install `pytorch-lightning` and the `examples/requirements.txt`. It is a shell pipeline which would automatically download, pre-process the data and run the models in `germeval-model` directory. Logs are saved in `lightning_logs` directory.
-
-Pass `--gpus` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: `TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}`
-
-
 #### Run the Tensorflow 2 version

 To start training, just run:
@@ -235,102 +184,3 @@ On the test dataset the following results could be achieved:
 micro avg     0.8722    0.8774    0.8748     13869
 macro avg     0.8712    0.8774    0.8740     13869
 ```
-
-### Emerging and Rare Entities task: WNUT’17 (English NER) dataset
-
-Description of the WNUT’17 task from the [shared task website](http://noisy-text.github.io/2017/index.html):
-
-> The WNUT’17 shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions.
-> Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on
-> them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms.
-
-Six labels are available in the dataset. An overview can be found on this [page](http://noisy-text.github.io/2017/files/).
-
-#### Data (Download and pre-processing steps)
-
-The dataset can be downloaded from the [official GitHub](https://github.com/leondz/emerging_entities_17) repository.
-
-The following commands show how to prepare the dataset for fine-tuning:
-
-```bash
-mkdir -p data_wnut_17
-
-curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/wnut17train.conll'  | tr '\t' ' ' > data_wnut_17/train.txt.tmp
-curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/emerging.dev.conll' | tr '\t' ' ' > data_wnut_17/dev.txt.tmp
-curl -L 'https://raw.githubusercontent.com/leondz/emerging_entities_17/master/emerging.test.annotated' | tr '\t' ' ' > data_wnut_17/test.txt.tmp
-```
-
-Let's define some variables that we need for further pre-processing steps:
-
-```bash
-export MAX_LENGTH=128
-export BERT_MODEL=bert-large-cased
-```
-
-Here we use the English BERT large model for fine-tuning.
-The `preprocess.py` scripts splits longer sentences into smaller ones (once the max. subtoken length is reached):
-
-```bash
-python3 scripts/preprocess.py data_wnut_17/train.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/train.txt
-python3 scripts/preprocess.py data_wnut_17/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/dev.txt
-python3 scripts/preprocess.py data_wnut_17/test.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/test.txt
-```
-
-In the last pre-processing step, the `labels.txt` file needs to be generated. This file contains all available labels:
-
-```bash
-cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > data_wnut_17/labels.txt
-```
-
-#### Run the Pytorch version
-
-Fine-tuning with the PyTorch version can be started using the `run_ner_old.py` script. In this example we use a JSON-based configuration file.
-
-This configuration file looks like:
-
-```json
-{
-    "data_dir": "./data_wnut_17",
-    "labels": "./data_wnut_17/labels.txt",
-    "model_name_or_path": "bert-large-cased",
-    "output_dir": "wnut-17-model-1",
-    "max_seq_length": 128,
-    "num_train_epochs": 3,
-    "per_device_train_batch_size": 32,
-    "save_steps": 425,
-    "seed": 1,
-    "do_train": true,
-    "do_eval": true,
-    "do_predict": true,
-    "fp16": false
-}
-```
-
-If your GPU supports half-precision training, please set `fp16` to `true`.
-
-Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner_old.py wnut_17.json`.
-
-#### Evaluation
-
-Evaluation on development dataset outputs the following:
-
-```bash
-05/29/2020 23:33:44 - INFO - __main__ -   ***** Eval results *****
-05/29/2020 23:33:44 - INFO - __main__ -     eval_loss = 0.26505235286212275
-05/29/2020 23:33:44 - INFO - __main__ -     eval_precision = 0.7008264462809918
-05/29/2020 23:33:44 - INFO - __main__ -     eval_recall = 0.507177033492823
-05/29/2020 23:33:44 - INFO - __main__ -     eval_f1 = 0.5884802220680084
-05/29/2020 23:33:44 - INFO - __main__ -     epoch = 3.0
-```
-
-On the test dataset the following results could be achieved:
-
-```bash
-05/29/2020 23:33:44 - INFO - transformers.trainer -   ***** Running Prediction *****
-05/29/2020 23:34:02 - INFO - __main__ -     eval_loss = 0.30948806500973547
-05/29/2020 23:34:02 - INFO - __main__ -     eval_precision = 0.5840108401084011
-05/29/2020 23:34:02 - INFO - __main__ -     eval_recall = 0.3994439295644115
-05/29/2020 23:34:02 - INFO - __main__ -     eval_f1 = 0.47440836543753434
-```
-
-WNUT’17 is a very difficult task. Current state-of-the-art results on this dataset can be found [here](http://nlpprogress.com/english/named_entity_recognition.html).
--- a/examples/token-classification/requirements.txt
+++ b/examples/token-classification/requirements.txt
@@ -0,0 +1,2 @@
+seqeval
+datasets >= 1.1.3
--- a/examples/token-classification/run.sh
+++ b/examples/token-classification/run.sh
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 python3 run_ner.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name conll2003 \
--- a/examples/token-classification/run_chunk.sh
+++ b/examples/token-classification/run_chunk.sh
@@ -1,37 +0,0 @@
-if ! [ -f ./dev.txt ]; then
-  echo "Downloading CONLL2003 dev dataset...."
-  curl -L -o ./dev.txt 'https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/valid.txt'
-fi
-
-if ! [ -f ./test.txt ]; then
-  echo "Downloading CONLL2003 test dataset...."
-  curl -L -o ./test.txt 'https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/test.txt'
-fi
-
-if ! [ -f ./train.txt ]; then
-  echo "Downloading CONLL2003 train dataset...."
-  curl -L -o ./train.txt 'https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/train.txt'
-fi
-
-export MAX_LENGTH=200
-export BERT_MODEL=bert-base-uncased
-export OUTPUT_DIR=chunker-model
-export BATCH_SIZE=32
-export NUM_EPOCHS=3
-export SAVE_STEPS=750
-export SEED=1
-
-python3 run_ner_old.py \
--task_type Chunk \
--data_dir . \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
-
--- a/examples/token-classification/run_ner_old.py
+++ b/examples/token-classification/run_ner_old.py
@@ -1,316 +0,0 @@
-# coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" Fine-tuning the library models for named entity recognition on CoNLL-2003. """
-import logging
-import os
-import sys
-from dataclasses import dataclass, field
-from importlib import import_module
-from typing import Dict, List, Optional, Tuple
-
-import numpy as np
-from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score
-from torch import nn
-
-import transformers
-from transformers import (
-    AutoConfig,
-    AutoModelForTokenClassification,
-    AutoTokenizer,
-    EvalPrediction,
-    HfArgumentParser,
-    Trainer,
-    TrainingArguments,
-    set_seed,
-)
-from transformers.trainer_utils import is_main_process
-from utils_ner import Split, TokenClassificationDataset, TokenClassificationTask
-
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class ModelArguments:
-    """
-    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
-    """
-
-    model_name_or_path: str = field(
-        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
-    )
-    config_name: Optional[str] = field(
-        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
-    )
-    task_type: Optional[str] = field(
-        default="NER", metadata={"help": "Task type to fine tune in training (e.g. NER, POS, etc)"}
-    )
-    tokenizer_name: Optional[str] = field(
-        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
-    )
-    use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."})
-    # If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
-    # or just modify its tokenizer_config.json.
-    cache_dir: Optional[str] = field(
-        default=None,
-        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
-    )
-
-
-@dataclass
-class DataTrainingArguments:
-    """
-    Arguments pertaining to what data we are going to input our model for training and eval.
-    """
-
-    data_dir: str = field(
-        metadata={"help": "The input data dir. Should contain the .txt files for a CoNLL-2003-formatted task."}
-    )
-    labels: Optional[str] = field(
-        default=None,
-        metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."},
-    )
-    max_seq_length: int = field(
-        default=128,
-        metadata={
-            "help": "The maximum total input sequence length after tokenization. Sequences longer "
-            "than this will be truncated, sequences shorter will be padded."
-        },
-    )
-    overwrite_cache: bool = field(
-        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
-    )
-
-
-def main():
-    # See all possible arguments in src/transformers/training_args.py
-    # or by passing the --help flag to this script.
-    # We now keep distinct sets of args, for a cleaner separation of concerns.
-
-    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
-    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
-        # If we pass only one argument to the script and it's the path to a json file,
-        # let's parse it to get our arguments.
-        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
-    else:
-        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
-
-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
-        )
-
-    module = import_module("tasks")
-    try:
-        token_classification_task_clazz = getattr(module, model_args.task_type)
-        token_classification_task: TokenClassificationTask = token_classification_task_clazz()
-    except AttributeError:
-        raise ValueError(
-            f"Task {model_args.task_type} needs to be defined as a TokenClassificationTask subclass in {module}. "
-            f"Available tasks classes are: {TokenClassificationTask.__subclasses__()}"
-        )
-
-    # Setup logging
-    logging.basicConfig(
-        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
-        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
-    )
-    logger.warning(
-        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
-        training_args.local_rank,
-        training_args.device,
-        training_args.n_gpu,
-        bool(training_args.local_rank != -1),
-        training_args.fp16,
-    )
-    # Set the verbosity to info of the Transformers logger (on main process only):
-    if is_main_process(training_args.local_rank):
-        transformers.utils.logging.set_verbosity_info()
-        transformers.utils.logging.enable_default_handler()
-        transformers.utils.logging.enable_explicit_format()
-    logger.info("Training/evaluation parameters %s", training_args)
-
-    # Set seed
-    set_seed(training_args.seed)
-
-    # Prepare CONLL-2003 task
-    labels = token_classification_task.get_labels(data_args.labels)
-    label_map: Dict[int, str] = {i: label for i, label in enumerate(labels)}
-    num_labels = len(labels)
-
-    # Load pretrained model and tokenizer
-    #
-    # Distributed training:
-    # The .from_pretrained methods guarantee that only one local process can concurrently
-    # download model & vocab.
-
-    config = AutoConfig.from_pretrained(
-        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
-        num_labels=num_labels,
-        id2label=label_map,
-        label2id={label: i for i, label in enumerate(labels)},
-        cache_dir=model_args.cache_dir,
-    )
-    tokenizer = AutoTokenizer.from_pretrained(
-        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
-        cache_dir=model_args.cache_dir,
-        use_fast=model_args.use_fast,
-    )
-    model = AutoModelForTokenClassification.from_pretrained(
-        model_args.model_name_or_path,
-        from_tf=bool(".ckpt" in model_args.model_name_or_path),
-        config=config,
-        cache_dir=model_args.cache_dir,
-    )
-
-    # Get datasets
-    train_dataset = (
-        TokenClassificationDataset(
-            token_classification_task=token_classification_task,
-            data_dir=data_args.data_dir,
-            tokenizer=tokenizer,
-            labels=labels,
-            model_type=config.model_type,
-            max_seq_length=data_args.max_seq_length,
-            overwrite_cache=data_args.overwrite_cache,
-            mode=Split.train,
-        )
-        if training_args.do_train
-        else None
-    )
-    eval_dataset = (
-        TokenClassificationDataset(
-            token_classification_task=token_classification_task,
-            data_dir=data_args.data_dir,
-            tokenizer=tokenizer,
-            labels=labels,
-            model_type=config.model_type,
-            max_seq_length=data_args.max_seq_length,
-            overwrite_cache=data_args.overwrite_cache,
-            mode=Split.dev,
-        )
-        if training_args.do_eval
-        else None
-    )
-
-    def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]:
-        preds = np.argmax(predictions, axis=2)
-
-        batch_size, seq_len = preds.shape
-
-        out_label_list = [[] for _ in range(batch_size)]
-        preds_list = [[] for _ in range(batch_size)]
-
-        for i in range(batch_size):
-            for j in range(seq_len):
-                if label_ids[i, j] != nn.CrossEntropyLoss().ignore_index:
-                    out_label_list[i].append(label_map[label_ids[i][j]])
-                    preds_list[i].append(label_map[preds[i][j]])
-
-        return preds_list, out_label_list
-
-    def compute_metrics(p: EvalPrediction) -> Dict:
-        preds_list, out_label_list = align_predictions(p.predictions, p.label_ids)
-        return {
-            "accuracy_score": accuracy_score(out_label_list, preds_list),
-            "precision": precision_score(out_label_list, preds_list),
-            "recall": recall_score(out_label_list, preds_list),
-            "f1": f1_score(out_label_list, preds_list),
-        }
-
-    # Initialize our Trainer
-    trainer = Trainer(
-        model=model,
-        args=training_args,
-        train_dataset=train_dataset,
-        eval_dataset=eval_dataset,
-        compute_metrics=compute_metrics,
-    )
-
-    # Training
-    if training_args.do_train:
-        trainer.train(
-            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
-        )
-        trainer.save_model()
-        # For convenience, we also re-save the tokenizer to the same directory,
-        # so that you can share your model easily on huggingface.co/models =)
-        if trainer.is_world_process_zero():
-            tokenizer.save_pretrained(training_args.output_dir)
-
-    # Evaluation
-    results = {}
-    if training_args.do_eval:
-        logger.info("*** Evaluate ***")
-
-        result = trainer.evaluate()
-
-        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
-        if trainer.is_world_process_zero():
-            with open(output_eval_file, "w") as writer:
-                logger.info("***** Eval results *****")
-                for key, value in result.items():
-                    logger.info("  %s = %s", key, value)
-                    writer.write("%s = %s\n" % (key, value))
-
-            results.update(result)
-
-    # Predict
-    if training_args.do_predict:
-        test_dataset = TokenClassificationDataset(
-            token_classification_task=token_classification_task,
-            data_dir=data_args.data_dir,
-            tokenizer=tokenizer,
-            labels=labels,
-            model_type=config.model_type,
-            max_seq_length=data_args.max_seq_length,
-            overwrite_cache=data_args.overwrite_cache,
-            mode=Split.test,
-        )
-
-        predictions, label_ids, metrics = trainer.predict(test_dataset)
-        preds_list, _ = align_predictions(predictions, label_ids)
-
-        output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt")
-        if trainer.is_world_process_zero():
-            with open(output_test_results_file, "w") as writer:
-                for key, value in metrics.items():
-                    logger.info("  %s = %s", key, value)
-                    writer.write("%s = %s\n" % (key, value))
-
-        # Save predictions
-        output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
-        if trainer.is_world_process_zero():
-            with open(output_test_predictions_file, "w") as writer:
-                with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
-                    token_classification_task.write_predictions_to_file(writer, f, preds_list)
-
-    return results
-
-
-def _mp_fn(index):
-    # For xla_spawn (TPUs)
-    main()
-
-
-if __name__ == "__main__":
-    main()
--- a/examples/token-classification/run_old.sh
+++ b/examples/token-classification/run_old.sh
@@ -1,36 +0,0 @@
-## The relevant files are currently on a shared Google
-## drive at https://drive.google.com/drive/folders/1kC0I2UGl2ltrluI9NqDjaQJGw5iliw_J
-## Monitor for changes and eventually migrate to nlp dataset
-curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
-curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
-curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
-
-export MAX_LENGTH=128
-export BERT_MODEL=bert-base-multilingual-cased
-python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
-python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
-python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
-cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
-export OUTPUT_DIR=germeval-model
-export BATCH_SIZE=32
-export NUM_EPOCHS=3
-export SAVE_STEPS=750
-export SEED=1
-
-python3 run_ner_old.py \
--task_type NER \
--data_dir . \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
--- a/examples/token-classification/run_pl.sh
+++ b/examples/token-classification/run_pl.sh
@@ -1,44 +0,0 @@
-#!/usr/bin/env bash
-
-# for seqeval metrics import
-pip install -r ../requirements.txt
-
-## The relevant files are currently on a shared Google
-## drive at https://drive.google.com/drive/folders/1kC0I2UGl2ltrluI9NqDjaQJGw5iliw_J
-## Monitor for changes and eventually migrate to nlp dataset
-curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
-curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
-curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
-
-export MAX_LENGTH=128
-export BERT_MODEL=bert-base-multilingual-cased
-python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
-python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
-python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
-cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
-export BATCH_SIZE=32
-export NUM_EPOCHS=3
-export SEED=1
-
-export OUTPUT_DIR_NAME=germeval-model
-export CURRENT_DIR=${PWD}
-export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
-mkdir -p $OUTPUT_DIR
-
-# Add parent directory to python path to access lightning_base.py
-export PYTHONPATH="../":"${PYTHONPATH}"
-
-python3 run_pl_ner.py --data_dir ./ \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--train_batch_size $BATCH_SIZE \
--seed $SEED \
--gpus 1 \
--do_train \
--do_predict
--- a/examples/token-classification/run_pl_ner.py
+++ b/examples/token-classification/run_pl_ner.py
@@ -1,215 +0,0 @@
-import argparse
-import glob
-import logging
-import os
-from argparse import Namespace
-from importlib import import_module
-
-import numpy as np
-import torch
-from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score
-from torch.nn import CrossEntropyLoss
-from torch.utils.data import DataLoader, TensorDataset
-
-from lightning_base import BaseTransformer, add_generic_args, generic_train
-from utils_ner import TokenClassificationTask
-
-
-logger = logging.getLogger(__name__)
-
-
-class NERTransformer(BaseTransformer):
-    """
-    A training module for NER. See BaseTransformer for the core options.
-    """
-
-    mode = "token-classification"
-
-    def __init__(self, hparams):
-        if type(hparams) == dict:
-            hparams = Namespace(**hparams)
-        module = import_module("tasks")
-        try:
-            token_classification_task_clazz = getattr(module, hparams.task_type)
-            self.token_classification_task: TokenClassificationTask = token_classification_task_clazz()
-        except AttributeError:
-            raise ValueError(
-                f"Task {hparams.task_type} needs to be defined as a TokenClassificationTask subclass in {module}. "
-                f"Available tasks classes are: {TokenClassificationTask.__subclasses__()}"
-            )
-        self.labels = self.token_classification_task.get_labels(hparams.labels)
-        self.pad_token_label_id = CrossEntropyLoss().ignore_index
-        super().__init__(hparams, len(self.labels), self.mode)
-
-    def forward(self, **inputs):
-        return self.model(**inputs)
-
-    def training_step(self, batch, batch_num):
-        "Compute loss and log."
-        inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
-        if self.config.model_type != "distilbert":
-            inputs["token_type_ids"] = (
-                batch[2] if self.config.model_type in ["bert", "xlnet"] else None
-            )  # XLM and RoBERTa don"t use token_type_ids
-
-        outputs = self(**inputs)
-        loss = outputs[0]
-        # tensorboard_logs = {"loss": loss, "rate": self.lr_scheduler.get_last_lr()[-1]}
-        return {"loss": loss}
-
-    def prepare_data(self):
-        "Called to initialize data. Use the call to construct features"
-        args = self.hparams
-        for mode in ["train", "dev", "test"]:
-            cached_features_file = self._feature_file(mode)
-            if os.path.exists(cached_features_file) and not args.overwrite_cache:
-                logger.info("Loading features from cached file %s", cached_features_file)
-                features = torch.load(cached_features_file)
-            else:
-                logger.info("Creating features from dataset file at %s", args.data_dir)
-                examples = self.token_classification_task.read_examples_from_file(args.data_dir, mode)
-                features = self.token_classification_task.convert_examples_to_features(
-                    examples,
-                    self.labels,
-                    args.max_seq_length,
-                    self.tokenizer,
-                    cls_token_at_end=bool(self.config.model_type in ["xlnet"]),
-                    cls_token=self.tokenizer.cls_token,
-                    cls_token_segment_id=2 if self.config.model_type in ["xlnet"] else 0,
-                    sep_token=self.tokenizer.sep_token,
-                    sep_token_extra=False,
-                    pad_on_left=bool(self.config.model_type in ["xlnet"]),
-                    pad_token=self.tokenizer.pad_token_id,
-                    pad_token_segment_id=self.tokenizer.pad_token_type_id,
-                    pad_token_label_id=self.pad_token_label_id,
-                )
-                logger.info("Saving features into cached file %s", cached_features_file)
-                torch.save(features, cached_features_file)
-
-    def get_dataloader(self, mode: int, batch_size: int, shuffle: bool = False) -> DataLoader:
-        "Load datasets. Called after prepare data."
-        cached_features_file = self._feature_file(mode)
-        logger.info("Loading features from cached file %s", cached_features_file)
-        features = torch.load(cached_features_file)
-        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
-        all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
-        if features[0].token_type_ids is not None:
-            all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
-        else:
-            all_token_type_ids = torch.tensor([0 for f in features], dtype=torch.long)
-            # HACK(we will not use this anymore soon)
-        all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
-        return DataLoader(
-            TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_label_ids), batch_size=batch_size
-        )
-
-    def validation_step(self, batch, batch_nb):
-        """Compute validation""" ""
-        inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
-        if self.config.model_type != "distilbert":
-            inputs["token_type_ids"] = (
-                batch[2] if self.config.model_type in ["bert", "xlnet"] else None
-            )  # XLM and RoBERTa don"t use token_type_ids
-        outputs = self(**inputs)
-        tmp_eval_loss, logits = outputs[:2]
-        preds = logits.detach().cpu().numpy()
-        out_label_ids = inputs["labels"].detach().cpu().numpy()
-        return {"val_loss": tmp_eval_loss.detach().cpu(), "pred": preds, "target": out_label_ids}
-
-    def _eval_end(self, outputs):
-        "Evaluation called for both Val and Test"
-        val_loss_mean = torch.stack([x["val_loss"] for x in outputs]).mean()
-        preds = np.concatenate([x["pred"] for x in outputs], axis=0)
-        preds = np.argmax(preds, axis=2)
-        out_label_ids = np.concatenate([x["target"] for x in outputs], axis=0)
-
-        label_map = {i: label for i, label in enumerate(self.labels)}
-        out_label_list = [[] for _ in range(out_label_ids.shape[0])]
-        preds_list = [[] for _ in range(out_label_ids.shape[0])]
-
-        for i in range(out_label_ids.shape[0]):
-            for j in range(out_label_ids.shape[1]):
-                if out_label_ids[i, j] != self.pad_token_label_id:
-                    out_label_list[i].append(label_map[out_label_ids[i][j]])
-                    preds_list[i].append(label_map[preds[i][j]])
-
-        results = {
-            "val_loss": val_loss_mean,
-            "accuracy_score": accuracy_score(out_label_list, preds_list),
-            "precision": precision_score(out_label_list, preds_list),
-            "recall": recall_score(out_label_list, preds_list),
-            "f1": f1_score(out_label_list, preds_list),
-        }
-
-        ret = {k: v for k, v in results.items()}
-        ret["log"] = results
-        return ret, preds_list, out_label_list
-
-    def validation_epoch_end(self, outputs):
-        # when stable
-        ret, preds, targets = self._eval_end(outputs)
-        logs = ret["log"]
-        return {"val_loss": logs["val_loss"], "log": logs, "progress_bar": logs}
-
-    def test_epoch_end(self, outputs):
-        # updating to test_epoch_end instead of deprecated test_end
-        ret, predictions, targets = self._eval_end(outputs)
-
-        # Converting to the dict required by pl
-        # https://github.com/PyTorchLightning/pytorch-lightning/blob/master/\
-        # pytorch_lightning/trainer/logging.py#L139
-        logs = ret["log"]
-        # `val_loss` is the key returned by `self._eval_end()` but actually refers to `test_loss`
-        return {"avg_test_loss": logs["val_loss"], "log": logs, "progress_bar": logs}
-
-    @staticmethod
-    def add_model_specific_args(parser, root_dir):
-        # Add NER specific options
-        BaseTransformer.add_model_specific_args(parser, root_dir)
-        parser.add_argument(
-            "--task_type", default="NER", type=str, help="Task type to fine tune in training (e.g. NER, POS, etc)"
-        )
-        parser.add_argument(
-            "--max_seq_length",
-            default=128,
-            type=int,
-            help="The maximum total input sequence length after tokenization. Sequences longer "
-            "than this will be truncated, sequences shorter will be padded.",
-        )
-
-        parser.add_argument(
-            "--labels",
-            default="",
-            type=str,
-            help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.",
-        )
-        parser.add_argument(
-            "--gpus",
-            default=0,
-            type=int,
-            help="The number of GPUs allocated for this, it is by default 0 meaning none",
-        )
-
-        parser.add_argument(
-            "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
-        )
-
-        return parser
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    add_generic_args(parser, os.getcwd())
-    parser = NERTransformer.add_model_specific_args(parser, os.getcwd())
-    args = parser.parse_args()
-    model = NERTransformer(args)
-    trainer = generic_train(model, args)
-
-    if args.do_predict:
-        # See https://github.com/huggingface/transformers/issues/3159
-        # pl use this default format to create a checkpoint:
-        # https://github.com/PyTorchLightning/pytorch-lightning/blob/master\
-        # /pytorch_lightning/callbacks/model_checkpoint.py#L322
-        checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "checkpoint-epoch=*.ckpt"), recursive=True)))
-        model = model.load_from_checkpoint(checkpoints[-1])
-        trainer.test(model)
--- a/examples/token-classification/run_pos.sh
+++ b/examples/token-classification/run_pos.sh
@@ -1,37 +0,0 @@
-if ! [ -f ./dev.txt ]; then
-  echo "Download dev dataset...."
-  curl -L -o ./dev.txt 'https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-dev.conllu'
-fi
-
-if ! [ -f ./test.txt ]; then
-  echo "Download test dataset...."
-  curl -L -o ./test.txt 'https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-test.conllu'
-fi
-
-if ! [ -f ./train.txt ]; then
-  echo "Download train dataset...."
-  curl -L -o ./train.txt 'https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-train.conllu'
-fi
-
-export MAX_LENGTH=200
-export BERT_MODEL=bert-base-uncased
-export OUTPUT_DIR=postagger-model
-export BATCH_SIZE=32
-export NUM_EPOCHS=3
-export SAVE_STEPS=750
-export SEED=1
-
-python3 run_ner_old.py \
--task_type POS \
--data_dir . \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
-
--- a/examples/token-classification/run_pos_pl.sh
+++ b/examples/token-classification/run_pos_pl.sh
@@ -1,39 +0,0 @@
-#!/usr/bin/env bash
-if ! [ -f ./dev.txt ]; then
-  echo "Download dev dataset...."
-  curl -L -o ./dev.txt 'https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-dev.conllu'
-fi
-
-if ! [ -f ./test.txt ]; then
-  echo "Download test dataset...."
-  curl -L -o ./test.txt 'https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-test.conllu'
-fi
-
-if ! [ -f ./train.txt ]; then
-  echo "Download train dataset...."
-  curl -L -o ./train.txt 'https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-train.conllu'
-fi
-
-export MAX_LENGTH=200
-export BERT_MODEL=bert-base-uncased
-export OUTPUT_DIR=postagger-model
-export BATCH_SIZE=32
-export NUM_EPOCHS=3
-export SAVE_STEPS=750
-export SEED=1
-
-
-# Add parent directory to python path to access lightning_base.py
-export PYTHONPATH="../":"${PYTHONPATH}"
-
-python3 run_pl_ner.py --data_dir ./ \
--task_type POS \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--train_batch_size $BATCH_SIZE \
--seed $SEED \
--gpus 1 \
--do_train \
--do_predict
--- a/examples/token-classification/scripts/preprocess.py
+++ b/examples/token-classification/scripts/preprocess.py
@@ -1,41 +0,0 @@
-import sys
-
-from transformers import AutoTokenizer
-
-
-dataset = sys.argv[1]
-model_name_or_path = sys.argv[2]
-max_len = int(sys.argv[3])
-
-subword_len_counter = 0
-
-tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
-max_len -= tokenizer.num_special_tokens_to_add()
-
-with open(dataset, "rt") as f_p:
-    for line in f_p:
-        line = line.rstrip()
-
-        if not line:
-            print(line)
-            subword_len_counter = 0
-            continue
-
-        token = line.split()[0]
-
-        current_subwords_len = len(tokenizer.tokenize(token))
-
-        # Token contains strange control characters like \x96 or \x95
-        # Just filter out the complete line
-        if current_subwords_len == 0:
-            continue
-
-        if (subword_len_counter + current_subwords_len) > max_len:
-            print("")
-            print(line)
-            subword_len_counter = current_subwords_len
-            continue
-
-        subword_len_counter += current_subwords_len
-
-        print(line)
--- a/examples/token-classification/tasks.py
+++ b/examples/token-classification/tasks.py
@@ -1,163 +0,0 @@
-import logging
-import os
-from typing import List, TextIO, Union
-
-from conllu import parse_incr
-
-from utils_ner import InputExample, Split, TokenClassificationTask
-
-
-logger = logging.getLogger(__name__)
-
-
-class NER(TokenClassificationTask):
-    def __init__(self, label_idx=-1):
-        # in NER datasets, the last column is usually reserved for NER label
-        self.label_idx = label_idx
-
-    def read_examples_from_file(self, data_dir, mode: Union[Split, str]) -> List[InputExample]:
-        if isinstance(mode, Split):
-            mode = mode.value
-        file_path = os.path.join(data_dir, f"{mode}.txt")
-        guid_index = 1
-        examples = []
-        with open(file_path, encoding="utf-8") as f:
-            words = []
-            labels = []
-            for line in f:
-                if line.startswith("-DOCSTART-") or line == "" or line == "\n":
-                    if words:
-                        examples.append(InputExample(guid=f"{mode}-{guid_index}", words=words, labels=labels))
-                        guid_index += 1
-                        words = []
-                        labels = []
-                else:
-                    splits = line.split(" ")
-                    words.append(splits[0])
-                    if len(splits) > 1:
-                        labels.append(splits[self.label_idx].replace("\n", ""))
-                    else:
-                        # Examples could have no label for mode = "test"
-                        labels.append("O")
-            if words:
-                examples.append(InputExample(guid=f"{mode}-{guid_index}", words=words, labels=labels))
-        return examples
-
-    def write_predictions_to_file(self, writer: TextIO, test_input_reader: TextIO, preds_list: List):
-        example_id = 0
-        for line in test_input_reader:
-            if line.startswith("-DOCSTART-") or line == "" or line == "\n":
-                writer.write(line)
-                if not preds_list[example_id]:
-                    example_id += 1
-            elif preds_list[example_id]:
-                output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
-                writer.write(output_line)
-            else:
-                logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
-
-    def get_labels(self, path: str) -> List[str]:
-        if path:
-            with open(path, "r") as f:
-                labels = f.read().splitlines()
-            if "O" not in labels:
-                labels = ["O"] + labels
-            return labels
-        else:
-            return ["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
-
-
-class Chunk(NER):
-    def __init__(self):
-        # in CONLL2003 dataset chunk column is second-to-last
-        super().__init__(label_idx=-2)
-
-    def get_labels(self, path: str) -> List[str]:
-        if path:
-            with open(path, "r") as f:
-                labels = f.read().splitlines()
-            if "O" not in labels:
-                labels = ["O"] + labels
-            return labels
-        else:
-            return [
-                "O",
-                "B-ADVP",
-                "B-INTJ",
-                "B-LST",
-                "B-PRT",
-                "B-NP",
-                "B-SBAR",
-                "B-VP",
-                "B-ADJP",
-                "B-CONJP",
-                "B-PP",
-                "I-ADVP",
-                "I-INTJ",
-                "I-LST",
-                "I-PRT",
-                "I-NP",
-                "I-SBAR",
-                "I-VP",
-                "I-ADJP",
-                "I-CONJP",
-                "I-PP",
-            ]
-
-
-class POS(TokenClassificationTask):
-    def read_examples_from_file(self, data_dir, mode: Union[Split, str]) -> List[InputExample]:
-        if isinstance(mode, Split):
-            mode = mode.value
-        file_path = os.path.join(data_dir, f"{mode}.txt")
-        guid_index = 1
-        examples = []
-
-        with open(file_path, encoding="utf-8") as f:
-            for sentence in parse_incr(f):
-                words = []
-                labels = []
-                for token in sentence:
-                    words.append(token["form"])
-                    labels.append(token["upos"])
-                assert len(words) == len(labels)
-                if words:
-                    examples.append(InputExample(guid=f"{mode}-{guid_index}", words=words, labels=labels))
-                    guid_index += 1
-        return examples
-
-    def write_predictions_to_file(self, writer: TextIO, test_input_reader: TextIO, preds_list: List):
-        example_id = 0
-        for sentence in parse_incr(test_input_reader):
-            s_p = preds_list[example_id]
-            out = ""
-            for token in sentence:
-                out += f'{token["form"]} ({token["upos"]}|{s_p.pop(0)}) '
-            out += "\n"
-            writer.write(out)
-            example_id += 1
-
-    def get_labels(self, path: str) -> List[str]:
-        if path:
-            with open(path, "r") as f:
-                return f.read().splitlines()
-        else:
-            return [
-                "ADJ",
-                "ADP",
-                "ADV",
-                "AUX",
-                "CCONJ",
-                "DET",
-                "INTJ",
-                "NOUN",
-                "NUM",
-                "PART",
-                "PRON",
-                "PROPN",
-                "PUNCT",
-                "SCONJ",
-                "SYM",
-                "VERB",
-                "X",
-            ]
--- a/examples/token-classification/test_ner_examples.py
+++ b/examples/token-classification/test_ner_examples.py
@@ -1,57 +0,0 @@
-import logging
-import sys
-import unittest
-from unittest.mock import patch
-
-import run_ner_old as run_ner
-from transformers.testing_utils import require_torch_non_multi_gpu_but_fix_me, slow
-
-
-logging.basicConfig(level=logging.INFO)
-
-logger = logging.getLogger()
-
-
-class ExamplesTests(unittest.TestCase):
-    @slow
-    @require_torch_non_multi_gpu_but_fix_me
-    def test_run_ner(self):
-        stream_handler = logging.StreamHandler(sys.stdout)
-        logger.addHandler(stream_handler)
-
-        testargs = """
-            --model_name distilbert-base-german-cased
-            --output_dir ./tests/fixtures/tests_samples/temp_dir
-            --overwrite_output_dir
-            --data_dir ./tests/fixtures/tests_samples/GermEval
-            --labels ./tests/fixtures/tests_samples/GermEval/labels.txt
-            --max_seq_length 128
-            --num_train_epochs 6
-            --logging_steps 1
-            --do_train
-            --do_eval
-            """.split()
-        with patch.object(sys, "argv", ["run.py"] + testargs):
-            result = run_ner.main()
-            self.assertLess(result["eval_loss"], 1.5)
-
-    @require_torch_non_multi_gpu_but_fix_me
-    def test_run_ner_pl(self):
-        stream_handler = logging.StreamHandler(sys.stdout)
-        logger.addHandler(stream_handler)
-
-        testargs = """
-            --model_name distilbert-base-german-cased
-            --output_dir ./tests/fixtures/tests_samples/temp_dir
-            --overwrite_output_dir
-            --data_dir ./tests/fixtures/tests_samples/GermEval
-            --labels ./tests/fixtures/tests_samples/GermEval/labels.txt
-            --max_seq_length 128
-            --num_train_epochs 6
-            --logging_steps 1
-            --do_train
-            --do_eval
-            """.split()
-        with patch.object(sys, "argv", ["run.py"] + testargs):
-            result = run_ner.main()
-            self.assertLess(result["eval_loss"], 1.5)
--- a/examples/token-classification/utils_ner.py
+++ b/examples/token-classification/utils_ner.py
@@ -1,372 +0,0 @@
-# coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" Named entity recognition fine-tuning: utilities to work with CoNLL-2003 task. """
-
-
-import logging
-import os
-from dataclasses import dataclass
-from enum import Enum
-from typing import List, Optional, Union
-
-from filelock import FileLock
-from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available
-
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class InputExample:
-    """
-    A single training/test example for token classification.
-
-    Args:
-        guid: Unique id for the example.
-        words: list. The words of the sequence.
-        labels: (Optional) list. The labels for each word of the sequence. This should be
-        specified for train and dev examples, but not for test examples.
-    """
-
-    guid: str
-    words: List[str]
-    labels: Optional[List[str]]
-
-
-@dataclass
-class InputFeatures:
-    """
-    A single set of features of data.
-    Property names are the same names as the corresponding inputs to a model.
-    """
-
-    input_ids: List[int]
-    attention_mask: List[int]
-    token_type_ids: Optional[List[int]] = None
-    label_ids: Optional[List[int]] = None
-
-
-class Split(Enum):
-    train = "train"
-    dev = "dev"
-    test = "test"
-
-
-class TokenClassificationTask:
-    @staticmethod
-    def read_examples_from_file(data_dir, mode: Union[Split, str]) -> List[InputExample]:
-        raise NotImplementedError
-
-    @staticmethod
-    def get_labels(path: str) -> List[str]:
-        raise NotImplementedError
-
-    @staticmethod
-    def convert_examples_to_features(
-        examples: List[InputExample],
-        label_list: List[str],
-        max_seq_length: int,
-        tokenizer: PreTrainedTokenizer,
-        cls_token_at_end=False,
-        cls_token="[CLS]",
-        cls_token_segment_id=1,
-        sep_token="[SEP]",
-        sep_token_extra=False,
-        pad_on_left=False,
-        pad_token=0,
-        pad_token_segment_id=0,
-        pad_token_label_id=-100,
-        sequence_a_segment_id=0,
-        mask_padding_with_zero=True,
-    ) -> List[InputFeatures]:
-        """Loads a data file into a list of `InputFeatures`
-        `cls_token_at_end` define the location of the CLS token:
-            - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
-            - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
-        `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
-        """
-        # TODO clean up all this to leverage built-in features of tokenizers
-
-        label_map = {label: i for i, label in enumerate(label_list)}
-
-        features = []
-        for (ex_index, example) in enumerate(examples):
-            if ex_index % 10_000 == 0:
-                logger.info("Writing example %d of %d", ex_index, len(examples))
-
-            tokens = []
-            label_ids = []
-            for word, label in zip(example.words, example.labels):
-                word_tokens = tokenizer.tokenize(word)
-
-                # bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
-                if len(word_tokens) > 0:
-                    tokens.extend(word_tokens)
-                    # Use the real label id for the first token of the word, and padding ids for the remaining tokens
-                    label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
-
-            # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
-            special_tokens_count = tokenizer.num_special_tokens_to_add()
-            if len(tokens) > max_seq_length - special_tokens_count:
-                tokens = tokens[: (max_seq_length - special_tokens_count)]
-                label_ids = label_ids[: (max_seq_length - special_tokens_count)]
-
-            # The convention in BERT is:
-            # (a) For sequence pairs:
-            #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
-            #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
-            # (b) For single sequences:
-            #  tokens:   [CLS] the dog is hairy . [SEP]
-            #  type_ids:   0   0   0   0  0     0   0
-            #
-            # Where "type_ids" are used to indicate whether this is the first
-            # sequence or the second sequence. The embedding vectors for `type=0` and
-            # `type=1` were learned during pre-training and are added to the wordpiece
-            # embedding vector (and position vector). This is not *strictly* necessary
-            # since the [SEP] token unambiguously separates the sequences, but it makes
-            # it easier for the model to learn the concept of sequences.
-            #
-            # For classification tasks, the first vector (corresponding to [CLS]) is
-            # used as as the "sentence vector". Note that this only makes sense because
-            # the entire model is fine-tuned.
-            tokens += [sep_token]
-            label_ids += [pad_token_label_id]
-            if sep_token_extra:
-                # roberta uses an extra separator b/w pairs of sentences
-                tokens += [sep_token]
-                label_ids += [pad_token_label_id]
-            segment_ids = [sequence_a_segment_id] * len(tokens)
-
-            if cls_token_at_end:
-                tokens += [cls_token]
-                label_ids += [pad_token_label_id]
-                segment_ids += [cls_token_segment_id]
-            else:
-                tokens = [cls_token] + tokens
-                label_ids = [pad_token_label_id] + label_ids
-                segment_ids = [cls_token_segment_id] + segment_ids
-
-            input_ids = tokenizer.convert_tokens_to_ids(tokens)
-
-            # The mask has 1 for real tokens and 0 for padding tokens. Only real
-            # tokens are attended to.
-            input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
-
-            # Zero-pad up to the sequence length.
-            padding_length = max_seq_length - len(input_ids)
-            if pad_on_left:
-                input_ids = ([pad_token] * padding_length) + input_ids
-                input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
-                segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
-                label_ids = ([pad_token_label_id] * padding_length) + label_ids
-            else:
-                input_ids += [pad_token] * padding_length
-                input_mask += [0 if mask_padding_with_zero else 1] * padding_length
-                segment_ids += [pad_token_segment_id] * padding_length
-                label_ids += [pad_token_label_id] * padding_length
-
-            assert len(input_ids) == max_seq_length
-            assert len(input_mask) == max_seq_length
-            assert len(segment_ids) == max_seq_length
-            assert len(label_ids) == max_seq_length
-
-            if ex_index < 5:
-                logger.info("*** Example ***")
-                logger.info("guid: %s", example.guid)
-                logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
-                logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
-                logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
-                logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
-                logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
-
-            if "token_type_ids" not in tokenizer.model_input_names:
-                segment_ids = None
-
-            features.append(
-                InputFeatures(
-                    input_ids=input_ids, attention_mask=input_mask, token_type_ids=segment_ids, label_ids=label_ids
-                )
-            )
-        return features
-
-
-if is_torch_available():
-    import torch
-    from torch import nn
-    from torch.utils.data.dataset import Dataset
-
-    class TokenClassificationDataset(Dataset):
-        """
-        This will be superseded by a framework-agnostic approach
-        soon.
-        """
-
-        features: List[InputFeatures]
-        pad_token_label_id: int = nn.CrossEntropyLoss().ignore_index
-        # Use cross entropy ignore_index as padding label id so that only
-        # real label ids contribute to the loss later.
-
-        def __init__(
-            self,
-            token_classification_task: TokenClassificationTask,
-            data_dir: str,
-            tokenizer: PreTrainedTokenizer,
-            labels: List[str],
-            model_type: str,
-            max_seq_length: Optional[int] = None,
-            overwrite_cache=False,
-            mode: Split = Split.train,
-        ):
-            # Load data features from cache or dataset file
-            cached_features_file = os.path.join(
-                data_dir,
-                "cached_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length)),
-            )
-
-            # Make sure only the first process in distributed training processes the dataset,
-            # and the others will use the cache.
-            lock_path = cached_features_file + ".lock"
-            with FileLock(lock_path):
-
-                if os.path.exists(cached_features_file) and not overwrite_cache:
-                    logger.info(f"Loading features from cached file {cached_features_file}")
-                    self.features = torch.load(cached_features_file)
-                else:
-                    logger.info(f"Creating features from dataset file at {data_dir}")
-                    examples = token_classification_task.read_examples_from_file(data_dir, mode)
-                    # TODO clean up all this to leverage built-in features of tokenizers
-                    self.features = token_classification_task.convert_examples_to_features(
-                        examples,
-                        labels,
-                        max_seq_length,
-                        tokenizer,
-                        cls_token_at_end=bool(model_type in ["xlnet"]),
-                        # xlnet has a cls token at the end
-                        cls_token=tokenizer.cls_token,
-                        cls_token_segment_id=2 if model_type in ["xlnet"] else 0,
-                        sep_token=tokenizer.sep_token,
-                        sep_token_extra=False,
-                        # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
-                        pad_on_left=bool(tokenizer.padding_side == "left"),
-                        pad_token=tokenizer.pad_token_id,
-                        pad_token_segment_id=tokenizer.pad_token_type_id,
-                        pad_token_label_id=self.pad_token_label_id,
-                    )
-                    logger.info(f"Saving features into cached file {cached_features_file}")
-                    torch.save(self.features, cached_features_file)
-
-        def __len__(self):
-            return len(self.features)
-
-        def __getitem__(self, i) -> InputFeatures:
-            return self.features[i]
-
-
-if is_tf_available():
-    import tensorflow as tf
-
-    class TFTokenClassificationDataset:
-        """
-        This will be superseded by a framework-agnostic approach
-        soon.
-        """
-
-        features: List[InputFeatures]
-        pad_token_label_id: int = -100
-        # Use cross entropy ignore_index as padding label id so that only
-        # real label ids contribute to the loss later.
-
-        def __init__(
-            self,
-            token_classification_task: TokenClassificationTask,
-            data_dir: str,
-            tokenizer: PreTrainedTokenizer,
-            labels: List[str],
-            model_type: str,
-            max_seq_length: Optional[int] = None,
-            overwrite_cache=False,
-            mode: Split = Split.train,
-        ):
-            examples = token_classification_task.read_examples_from_file(data_dir, mode)
-            # TODO clean up all this to leverage built-in features of tokenizers
-            self.features = token_classification_task.convert_examples_to_features(
-                examples,
-                labels,
-                max_seq_length,
-                tokenizer,
-                cls_token_at_end=bool(model_type in ["xlnet"]),
-                # xlnet has a cls token at the end
-                cls_token=tokenizer.cls_token,
-                cls_token_segment_id=2 if model_type in ["xlnet"] else 0,
-                sep_token=tokenizer.sep_token,
-                sep_token_extra=False,
-                # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
-                pad_on_left=bool(tokenizer.padding_side == "left"),
-                pad_token=tokenizer.pad_token_id,
-                pad_token_segment_id=tokenizer.pad_token_type_id,
-                pad_token_label_id=self.pad_token_label_id,
-            )
-
-            def gen():
-                for ex in self.features:
-                    if ex.token_type_ids is None:
-                        yield (
-                            {"input_ids": ex.input_ids, "attention_mask": ex.attention_mask},
-                            ex.label_ids,
-                        )
-                    else:
-                        yield (
-                            {
-                                "input_ids": ex.input_ids,
-                                "attention_mask": ex.attention_mask,
-                                "token_type_ids": ex.token_type_ids,
-                            },
-                            ex.label_ids,
-                        )
-
-            if "token_type_ids" not in tokenizer.model_input_names:
-                self.dataset = tf.data.Dataset.from_generator(
-                    gen,
-                    ({"input_ids": tf.int32, "attention_mask": tf.int32}, tf.int64),
-                    (
-                        {"input_ids": tf.TensorShape([None]), "attention_mask": tf.TensorShape([None])},
-                        tf.TensorShape([None]),
-                    ),
-                )
-            else:
-                self.dataset = tf.data.Dataset.from_generator(
-                    gen,
-                    ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
-                    (
-                        {
-                            "input_ids": tf.TensorShape([None]),
-                            "attention_mask": tf.TensorShape([None]),
-                            "token_type_ids": tf.TensorShape([None]),
-                        },
-                        tf.TensorShape([None]),
-                    ),
-                )
-
-        def get_dataset(self):
-            self.dataset = self.dataset.apply(tf.data.experimental.assert_cardinality(len(self.features)))
-
-            return self.dataset
-
-        def __len__(self):
-            return len(self.features)
-
-        def __getitem__(self, i) -> InputFeatures:
-            return self.features[i]