is_pretokenized -> is_split_into_words (#7236)

* is_pretokenized -> is_split_into_words * Fix tests
2020-09-22 09:34:35 -04:00
parent 324f361e91
commit 21ca148090
9 changed files with 142 additions and 72 deletions
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@@ -290,12 +290,12 @@ predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Na
    if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
    like BPE).

-If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
+If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
 tokenizer. For instance, we have:

 .. code-block::

-    >>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
+    >>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
    >>> print(encoded_input)
    {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
@@ -312,7 +312,7 @@ like this:
    batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
                       ["And", "another", "sentence"],
                       ["And", "the", "very", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, is_pretokenized=True)
+    encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)

 or a batch of pair sentences like this:

@@ -321,7 +321,7 @@ or a batch of pair sentences like this:
    batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
                                 ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
                                 ["And", "I", "go", "with", "the", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_pretokenized=True)
+    encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)

 And you can add padding, truncation as well as directly return tensors like before:

@@ -330,14 +330,14 @@ And you can add padding, truncation as well as directly return tensors like befo
    ## PYTORCH CODE
    batch = tokenizer(batch_sentences,
                      batch_of_second_sentences,
-                      is_pretokenized=True,
+                      is_split_into_words=True,
                      padding=True,
                      truncation=True,
                      return_tensors="pt")
    ## TENSORFLOW CODE
    batch = tokenizer(batch_sentences,
                      batch_of_second_sentences,
-                      is_pretokenized=True,
+                      is_split_into_words=True,
                      padding=True,
                      truncation=True,
                      return_tensors="tf")