is_pretokenized -> is_split_into_words (#7236)

* is_pretokenized -> is_split_into_words

* Fix tests
This commit is contained in:
Sylvain Gugger
2020-09-22 09:34:35 -04:00
committed by GitHub
parent 324f361e91
commit 21ca148090
9 changed files with 142 additions and 72 deletions

View File

@@ -290,12 +290,12 @@ predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Na
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
like BPE).
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
tokenizer. For instance, we have:
.. code-block::
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
>>> print(encoded_input)
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
@@ -312,7 +312,7 @@ like this:
batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
["And", "another", "sentence"],
["And", "the", "very", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, is_pretokenized=True)
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
or a batch of pair sentences like this:
@@ -321,7 +321,7 @@ or a batch of pair sentences like this:
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
["And", "I", "go", "with", "the", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_pretokenized=True)
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
And you can add padding, truncation as well as directly return tensors like before:
@@ -330,14 +330,14 @@ And you can add padding, truncation as well as directly return tensors like befo
## PYTORCH CODE
batch = tokenizer(batch_sentences,
batch_of_second_sentences,
is_pretokenized=True,
is_split_into_words=True,
padding=True,
truncation=True,
return_tensors="pt")
## TENSORFLOW CODE
batch = tokenizer(batch_sentences,
batch_of_second_sentences,
is_pretokenized=True,
is_split_into_words=True,
padding=True,
truncation=True,
return_tensors="tf")