From 946400fb68ef743c52777665b7ef33a60aab5150 Mon Sep 17 00:00:00 2001 From: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Fri, 19 Mar 2021 10:06:08 -0400 Subject: [PATCH] Expand a bit the presentation of examples (#10799) * Expand a bit the presentation of examples * Apply suggestions from code review Co-authored-by: Stas Bekman * Address review comments Co-authored-by: Stas Bekman --- examples/README.md | 9 +++++++-- examples/language-modeling/README.md | 2 +- examples/multiple-choice/README.md | 4 +++- examples/question-answering/README.md | 5 +++++ examples/seq2seq/README.md | 2 +- examples/token-classification/README.md | 2 +- 6 files changed, 18 insertions(+), 6 deletions(-) diff --git a/examples/README.md b/examples/README.md index 1b2422f76d..4e2e4afc45 100644 --- a/examples/README.md +++ b/examples/README.md @@ -15,8 +15,13 @@ limitations under the License. # Examples -This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to -be in this folder, it may have moved to our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects). +This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to be in this folder, it may have moved to our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects) or to the [legacy](https://github.com/huggingface/transformers/tree/master/examples/legacy) subfolder. + +While we strive to present as many use cases as possible, the scripts in this folder are just examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, all the PyTorch versions of the examples fully expose the preprocessing of the data. This way, you can easily tweak them. + +This is similar if you want the scripts to report another metric than the one they currently use: look at the `compute_metrics` function inside the script. It takes the full arrays of predictions and labels and has to return a dictionary of string keys and float values. Just change it to add (or replace) your own metric to the ones already reported. + +Please discuss on the [forum](https://discuss.huggingface.co/) or in an [issue](https://github.com/huggingface/transformers/issues) a feature you would like to implement in an example before submitting a PR: we welcome bug fixes but since we want to keep the examples as simple as possible, it's unlikely we will merge a pull request adding more functionality at the cost of readability. ## Important note diff --git a/examples/language-modeling/README.md b/examples/language-modeling/README.md index 6d913bbfa2..d2499651cd 100644 --- a/examples/language-modeling/README.md +++ b/examples/language-modeling/README.md @@ -27,7 +27,7 @@ need extra processing on your datasets. **Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py). -The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own +The following examples, will run on datasets hosted on our [hub](https://huggingface.co/datasets) or with your own text files for training and validation. We give examples of both below. ### GPT-2/GPT and causal language modeling diff --git a/examples/multiple-choice/README.md b/examples/multiple-choice/README.md index 34d1dfee13..22b0c59f1b 100644 --- a/examples/multiple-choice/README.md +++ b/examples/multiple-choice/README.md @@ -18,7 +18,9 @@ limitations under the License. Based on the script [`run_swag.py`](). -#### Fine-tuning on SWAG +## PyTorch script: fine-tuning on SWAG + +`run_swag` allows you to fine-tune any model from our [hub](https://huggingface.co/models) (as long as its architecture as a `ForMultipleChoice` version in the library) on the SWAG dataset or your own csv/jsonlines files as long as they are structured the same way. To make it works on another dataset, you will need to tweak the `preprocess_function` inside the script. ```bash python examples/multiple-choice/run_swag.py \ diff --git a/examples/question-answering/README.md b/examples/question-answering/README.md index 3d222c8365..71799e8e22 100644 --- a/examples/question-answering/README.md +++ b/examples/question-answering/README.md @@ -24,6 +24,11 @@ uses special features of those tokenizers. You can check if your favorite model of the script. The old version of this script can be found [here](https://github.com/huggingface/transformers/tree/master/examples/legacy/question-answering). + +`run_qa.py` allows you to fine-tune any model from our [hub](https://huggingface.co/models) (as long as its architecture as a `ForQuestionAnswering` version in the library) on the SQUAD dataset or another question-answering dataset of the `datasets` library or your own csv/jsonlines files as long as they are structured the same way as SQUAD. You might need to tweak the data processing inside the script if your data is structured differently. + +Note that if your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag `--version_2_with_negative`. + #### Fine-tuning BERT on SQuAD1.0 This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) diff --git a/examples/seq2seq/README.md b/examples/seq2seq/README.md index b8dbe7b903..a79738f3ee 100644 --- a/examples/seq2seq/README.md +++ b/examples/seq2seq/README.md @@ -114,7 +114,7 @@ and you wanted to select only `text` and `summary`, then you'd pass these additi --summary_column summary \ ``` -#### Custom JSONFILES Files +#### Custom JSONLINES Files The second supported format is jsonlines. Here is an example of a jsonlines custom data file. diff --git a/examples/token-classification/README.md b/examples/token-classification/README.md index e2d11e39c4..a556052f64 100644 --- a/examples/token-classification/README.md +++ b/examples/token-classification/README.md @@ -21,7 +21,7 @@ tagging (POS). The main scrip `run_ner.py` leverages the 🤗 Datasets library a customize it to your needs if you need extra processing on your datasets. It will either run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own text files for -training and validation. +training and validation, you might just need to add some tweaks in the data preprocessing. The following example fine-tunes BERT on CoNLL-2003: