is_pretokenized -> is_split_into_words (#7236)
* is_pretokenized -> is_split_into_words * Fix tests
This commit is contained in:
@@ -324,7 +324,7 @@ which we'll use in a moment:
|
||||
id2tag = {id: tag for tag, id in tag2id.items()}
|
||||
|
||||
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing
|
||||
with ready-split tokens rather than full sentence strings by passing ``is_pretokenized=True``. We'll also pass
|
||||
with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
|
||||
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model
|
||||
to return information about the tokens which are split by the wordpiece tokenization process, which we will need in
|
||||
a moment.
|
||||
@@ -333,8 +333,8 @@ a moment.
|
||||
|
||||
from transformers import DistilBertTokenizerFast
|
||||
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
|
||||
train_encodings = tokenizer(train_texts, is_pretokenized=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
val_encodings = tokenizer(val_texts, is_pretokenized=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
|
||||
Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
|
||||
model below.
|
||||
|
||||
Reference in New Issue
Block a user