From e79a0faeae808083340174c1824e6ad4d666222f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Paw=C5=82owski?= <56093010+Pawloch247@users.noreply.github.com> Date: Tue, 25 Jan 2022 23:26:17 +0100 Subject: [PATCH] Added missing code in exemplary notebook - custom datasets fine-tuning (#15300) * Added missing code in exemplary notebook - custom datasets fine-tuning Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification. The missing code concerns adding labels for all but first token in a single word. The added code was taken directly from huggingface official example - this [colab notebook](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb). * Changes requested in the review - keep the code as simple as possible --- docs/source/custom_datasets.mdx | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/source/custom_datasets.mdx b/docs/source/custom_datasets.mdx index 5dd1801f38..5fb5af8068 100644 --- a/docs/source/custom_datasets.mdx +++ b/docs/source/custom_datasets.mdx @@ -326,7 +326,9 @@ def tokenize_and_align_labels(examples): label_ids.append(-100) elif word_idx != previous_word_idx: # Only label the first token of a given word. label_ids.append(label[word_idx]) - + else: + label_ids.append(-100) + previous_word_idx = word_idx labels.append(label_ids) tokenized_inputs["labels"] = labels