is_pretokenized -> is_split_into_words (#7236)
* is_pretokenized -> is_split_into_words * Fix tests
This commit is contained in:
@@ -324,7 +324,7 @@ which we'll use in a moment:
|
||||
id2tag = {id: tag for tag, id in tag2id.items()}
|
||||
|
||||
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing
|
||||
with ready-split tokens rather than full sentence strings by passing ``is_pretokenized=True``. We'll also pass
|
||||
with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
|
||||
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model
|
||||
to return information about the tokens which are split by the wordpiece tokenization process, which we will need in
|
||||
a moment.
|
||||
@@ -333,8 +333,8 @@ a moment.
|
||||
|
||||
from transformers import DistilBertTokenizerFast
|
||||
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
|
||||
train_encodings = tokenizer(train_texts, is_pretokenized=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
val_encodings = tokenizer(val_texts, is_pretokenized=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
|
||||
|
||||
Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
|
||||
model below.
|
||||
|
||||
@@ -290,12 +290,12 @@ predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Na
|
||||
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
|
||||
like BPE).
|
||||
|
||||
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
|
||||
If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
|
||||
tokenizer. For instance, we have:
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
|
||||
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
|
||||
>>> print(encoded_input)
|
||||
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
||||
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
@@ -312,7 +312,7 @@ like this:
|
||||
batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
|
||||
["And", "another", "sentence"],
|
||||
["And", "the", "very", "very", "last", "one"]]
|
||||
encoded_inputs = tokenizer(batch_sentences, is_pretokenized=True)
|
||||
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
|
||||
|
||||
or a batch of pair sentences like this:
|
||||
|
||||
@@ -321,7 +321,7 @@ or a batch of pair sentences like this:
|
||||
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
|
||||
["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
|
||||
["And", "I", "go", "with", "the", "very", "last", "one"]]
|
||||
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_pretokenized=True)
|
||||
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
|
||||
|
||||
And you can add padding, truncation as well as directly return tensors like before:
|
||||
|
||||
@@ -330,14 +330,14 @@ And you can add padding, truncation as well as directly return tensors like befo
|
||||
## PYTORCH CODE
|
||||
batch = tokenizer(batch_sentences,
|
||||
batch_of_second_sentences,
|
||||
is_pretokenized=True,
|
||||
is_split_into_words=True,
|
||||
padding=True,
|
||||
truncation=True,
|
||||
return_tensors="pt")
|
||||
## TENSORFLOW CODE
|
||||
batch = tokenizer(batch_sentences,
|
||||
batch_of_second_sentences,
|
||||
is_pretokenized=True,
|
||||
is_split_into_words=True,
|
||||
padding=True,
|
||||
truncation=True,
|
||||
return_tensors="tf")
|
||||
|
||||
Reference in New Issue
Block a user