is_pretokenized -> is_split_into_words (#7236)

* is_pretokenized -> is_split_into_words * Fix tests
2020-09-22 09:34:35 -04:00
parent 324f361e91
commit 21ca148090
9 changed files with 142 additions and 72 deletions
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
@@ -324,7 +324,7 @@ which we'll use in a moment:
    id2tag = {id: tag for tag, id in tag2id.items()}

 To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing
-with ready-split tokens rather than full sentence strings by passing ``is_pretokenized=True``. We'll also pass
+with ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
 ``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model
 to return information about the tokens which are split by the wordpiece tokenization process, which we will need in
 a moment.
@@ -333,8 +333,8 @@ a moment.

    from transformers import DistilBertTokenizerFast
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
-    train_encodings = tokenizer(train_texts, is_pretokenized=True, return_offsets_mapping=True, padding=True, truncation=True)
-    val_encodings = tokenizer(val_texts, is_pretokenized=True, return_offsets_mapping=True, padding=True, truncation=True)
+    train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
+    val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)

 Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
 model below.