is_pretokenized -> is_split_into_words (#7236)
* is_pretokenized -> is_split_into_words * Fix tests
This commit is contained in:
@@ -290,12 +290,12 @@ predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Na
|
||||
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
|
||||
like BPE).
|
||||
|
||||
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
|
||||
If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
|
||||
tokenizer. For instance, we have:
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
|
||||
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
|
||||
>>> print(encoded_input)
|
||||
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
||||
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
@@ -312,7 +312,7 @@ like this:
|
||||
batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
|
||||
["And", "another", "sentence"],
|
||||
["And", "the", "very", "very", "last", "one"]]
|
||||
encoded_inputs = tokenizer(batch_sentences, is_pretokenized=True)
|
||||
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
|
||||
|
||||
or a batch of pair sentences like this:
|
||||
|
||||
@@ -321,7 +321,7 @@ or a batch of pair sentences like this:
|
||||
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
|
||||
["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
|
||||
["And", "I", "go", "with", "the", "very", "last", "one"]]
|
||||
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_pretokenized=True)
|
||||
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
|
||||
|
||||
And you can add padding, truncation as well as directly return tensors like before:
|
||||
|
||||
@@ -330,14 +330,14 @@ And you can add padding, truncation as well as directly return tensors like befo
|
||||
## PYTORCH CODE
|
||||
batch = tokenizer(batch_sentences,
|
||||
batch_of_second_sentences,
|
||||
is_pretokenized=True,
|
||||
is_split_into_words=True,
|
||||
padding=True,
|
||||
truncation=True,
|
||||
return_tensors="pt")
|
||||
## TENSORFLOW CODE
|
||||
batch = tokenizer(batch_sentences,
|
||||
batch_of_second_sentences,
|
||||
is_pretokenized=True,
|
||||
is_split_into_words=True,
|
||||
padding=True,
|
||||
truncation=True,
|
||||
return_tensors="tf")
|
||||
|
||||
Reference in New Issue
Block a user