From c0d97cee139cea9e40b967e18eb6a807b4363452 Mon Sep 17 00:00:00 2001 From: Lysandre Debut Date: Wed, 7 Apr 2021 10:06:45 -0400 Subject: [PATCH] =?UTF-8?q?Adds=20a=20note=20to=20resize=20the=20token=20e?= =?UTF-8?q?mbedding=20matrix=20when=20adding=20special=20=E2=80=A6=20(#111?= =?UTF-8?q?20)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Adds a note to resize the token embedding matrix when adding special tokens * Remove superfluous space --- src/transformers/tokenization_utils_base.py | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py index 6ccf3f48f7..7b68164b91 100644 --- a/src/transformers/tokenization_utils_base.py +++ b/src/transformers/tokenization_utils_base.py @@ -825,7 +825,13 @@ class SpecialTokensMixin: special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary). - Using : obj:`add_special_tokens` will ensure your special tokens can be used in several ways: + .. Note:: + When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of + the model so that its embedding matrix matches the tokenizer. + + In order to do that, please use the :meth:`~transformers.PreTrainedModel.resize_token_embeddings` method. + + Using :obj:`add_special_tokens` will ensure your special tokens can be used in several ways: - Special tokens are carefully handled by the tokenizer (they are never split). - You can easily refer to special tokens using tokenizer class attributes like :obj:`tokenizer.cls_token`. This