From e79a0faeae808083340174c1824e6ad4d666222f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Paw=C5=82owski?=
 <56093010+Pawloch247@users.noreply.github.com>
Date: Tue, 25 Jan 2022 23:26:17 +0100
Subject: [PATCH] Added missing code in exemplary notebook - custom datasets
 fine-tuning (#15300)

* Added missing code in exemplary notebook - custom datasets fine-tuning

Added missing code in tokenize_and_align_labels function in the exemplary notebook on custom datasets - token classification.
The missing code concerns adding labels for all but first token in a single word.
The added code was taken directly from huggingface official example - this [colab notebook](https://github.com/huggingface/notebooks/blob/master/transformers_doc/custom_datasets.ipynb).

* Changes requested in the review - keep the code as simple as possible
---
 docs/source/custom_datasets.mdx | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/source/custom_datasets.mdx b/docs/source/custom_datasets.mdx
index 5dd1801f38..5fb5af8068 100644
--- a/docs/source/custom_datasets.mdx
+++ b/docs/source/custom_datasets.mdx
@@ -326,7 +326,9 @@ def tokenize_and_align_labels(examples):
                 label_ids.append(-100)
             elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                 label_ids.append(label[word_idx])
-
+            else:
+                label_ids.append(-100)
+            previous_word_idx = word_idx
         labels.append(label_ids)
 
     tokenized_inputs["labels"] = labels