From d24ea708d742263efe4f4b8d525402f2d916c96c Mon Sep 17 00:00:00 2001 From: Oren Amsalem Date: Thu, 30 Jul 2020 13:13:29 +0300 Subject: [PATCH] Actually the extra_id are from 0-99 and not from 1-100 (#5967) a = tokenizer.encode("we got a ", return_tensors='pt',add_special_tokens=True) print(a) >tensor([[ 62, 530, 3, 9, 32000]]) a = tokenizer.encode("we got a ", return_tensors='pt',add_special_tokens=True) print(a) >tensor([[ 62, 530, 3, 9, 3, 2, 25666, 834, 23, 26, 834, 2915, 3155]]) --- docs/source/model_doc/t5.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/model_doc/t5.rst b/docs/source/model_doc/t5.rst index 2e7bd285f0..f7451300c8 100644 --- a/docs/source/model_doc/t5.rst +++ b/docs/source/model_doc/t5.rst @@ -38,13 +38,13 @@ T5 can be trained / fine-tuned both in a supervised and unsupervised fashion. In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. - Each sentinel token represents a unique mask token for this sentence and should start with ````, ````, ... up to ````. As a default 100 sentinel tokens are available in ``T5Tokenizer``. + Each sentinel token represents a unique mask token for this sentence and should start with ````, ````, ... up to ````. As a default 100 sentinel tokens are available in ``T5Tokenizer``. *E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows: :: - input_ids = tokenizer.encode('The walks in park', return_tensors='pt') - labels = tokenizer.encode(' cute dog the ', return_tensors='pt') + input_ids = tokenizer.encode('The walks in park', return_tensors='pt') + labels = tokenizer.encode(' cute dog the ', return_tensors='pt') # the forward function automatically creates the correct decoder_input_ids model(input_ids=input_ids, labels=labels)