Minor typo fixes to the preprocessing tutorial in the docs (#8046)
* Fix minor typos Fix minor typos in the docs. * Update docs/source/preprocessing.rst Clearer data structure description. Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -51,7 +51,7 @@ The tokenizer can decode a list of token ids in a proper sentence:
|
|||||||
>>> tokenizer.decode(encoded_input["input_ids"])
|
>>> tokenizer.decode(encoded_input["input_ids"])
|
||||||
"[CLS] Hello, I'm a single sentence! [SEP]"
|
"[CLS] Hello, I'm a single sentence! [SEP]"
|
||||||
|
|
||||||
As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
|
As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need special
|
||||||
tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
|
tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
|
||||||
seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
|
seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
|
||||||
those special tokens yourself) by passing ``add_special_tokens=False``.
|
those special tokens yourself) by passing ``add_special_tokens=False``.
|
||||||
@@ -76,7 +76,7 @@ tokenizer:
|
|||||||
[1, 1, 1, 1, 1],
|
[1, 1, 1, 1, 1],
|
||||||
[1, 1, 1, 1, 1, 1, 1, 1]]}
|
[1, 1, 1, 1, 1, 1, 1, 1]]}
|
||||||
|
|
||||||
We get back a dictionary once again, this time with values being list of list of ints.
|
We get back a dictionary once again, this time with values being lists of lists of ints.
|
||||||
|
|
||||||
If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
|
If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
|
||||||
probably want:
|
probably want:
|
||||||
@@ -114,7 +114,7 @@ You can do all of this by using the following options when feeding your list of
|
|||||||
[1, 1, 1, 1, 1, 0, 0, 0, 0],
|
[1, 1, 1, 1, 1, 0, 0, 0, 0],
|
||||||
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
|
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
|
||||||
|
|
||||||
It returns a dictionary string to tensor. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
|
It returns a dictionary with string keys and tensor values. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
|
||||||
all about: it points out which tokens the model should pay attention to and which ones it should not (because they
|
all about: it points out which tokens the model should pay attention to and which ones it should not (because they
|
||||||
represent padding in this case).
|
represent padding in this case).
|
||||||
|
|
||||||
@@ -127,7 +127,7 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
|
|||||||
Preprocessing pairs of sentences
|
Preprocessing pairs of sentences
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
|
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in a
|
||||||
pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
|
pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
|
||||||
then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
|
then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
|
||||||
|
|
||||||
@@ -179,7 +179,7 @@ list of first sentences and the list of second sentences:
|
|||||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
|
||||||
|
|
||||||
As we can see, it returns a dictionary with the values being list of lists of ints.
|
As we can see, it returns a dictionary where each value is a list of lists of ints.
|
||||||
|
|
||||||
To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
|
To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
|
||||||
|
|
||||||
@@ -286,7 +286,7 @@ predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Na
|
|||||||
|
|
||||||
.. warning::
|
.. warning::
|
||||||
|
|
||||||
Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
|
Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
|
||||||
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
|
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
|
||||||
like BPE).
|
like BPE).
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user