Refactor Code samples; Test code samples (#5036)
* Refactor code samples * Test docstrings * Style * Tokenization examples * Run rust of tests * First step to testing source docs * Style and BART comment * Test the remainder of the code samples * Style * let to const * Formatting fixes * Ready for merge * Fix fixture + Style * Fix last tests * Update docs/source/quicktour.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Addressing @sgugger's comments + Fix MobileBERT in TF Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -45,17 +45,16 @@ tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ token
|
||||
|
||||
::
|
||||
|
||||
from transformers import BertTokenizer
|
||||
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
>>> from transformers import BertTokenizer
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
|
||||
sequence = "A Titan RTX has 24GB of VRAM"
|
||||
>>> sequence = "A Titan RTX has 24GB of VRAM"
|
||||
|
||||
The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
|
||||
|
||||
::
|
||||
|
||||
tokenized_sequence = tokenizer.tokenize(sequence)
|
||||
print(tokenized_sequence)
|
||||
>>> tokenized_sequence = tokenizer.tokenize(sequence)
|
||||
|
||||
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
|
||||
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-dash is
|
||||
@@ -63,6 +62,7 @@ added for "RA" and "M":
|
||||
|
||||
::
|
||||
|
||||
>>> print(tokenized_sequence)
|
||||
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
|
||||
|
||||
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
|
||||
@@ -71,14 +71,14 @@ the sentence to the tokenizer, which leverages the Rust implementation of
|
||||
|
||||
::
|
||||
|
||||
encoded_sequence = tokenizer(sequence)["input_ids"]
|
||||
print(encoded_sequence)
|
||||
>>> encoded_sequence = tokenizer(sequence)["input_ids"]
|
||||
|
||||
The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The
|
||||
token indices are under the key "input_ids":
|
||||
|
||||
::
|
||||
|
||||
>>> print(encoded_sequence)
|
||||
[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
|
||||
|
||||
Note that the tokenizer automatically adds "special tokens" (if the associated model rely on them) which are special
|
||||
@@ -86,13 +86,14 @@ IDs the model sometimes uses. If we decode the previous sequence of ids,
|
||||
|
||||
::
|
||||
|
||||
tokenizer.decode(encoded_sequence)
|
||||
>>> decoded_sequence = tokenizer.decode(encoded_sequence)
|
||||
|
||||
we will see
|
||||
|
||||
::
|
||||
|
||||
'[CLS] A Titan RTX has 24GB of VRAM [SEP]'
|
||||
>>> print(decoded_sequence)
|
||||
[CLS] A Titan RTX has 24GB of VRAM [SEP]
|
||||
|
||||
because this is the way a :class:`~transformers.BertModel` is going to expect its inputs.
|
||||
|
||||
@@ -108,21 +109,20 @@ For example, consider these two sequences:
|
||||
|
||||
::
|
||||
|
||||
from transformers import BertTokenizer
|
||||
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
>>> from transformers import BertTokenizer
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
|
||||
sequence_a = "This is a short sequence."
|
||||
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
|
||||
>>> sequence_a = "This is a short sequence."
|
||||
>>> sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
|
||||
|
||||
encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
|
||||
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]
|
||||
|
||||
len(encoded_sequence_a), len(encoded_sequence_b)
|
||||
>>> encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
|
||||
>>> encoded_sequence_b = tokenizer(sequence_b)["input_ids"]
|
||||
|
||||
The encoded versions have different lengths:
|
||||
|
||||
::
|
||||
|
||||
>>> len(encoded_sequence_a), len(encoded_sequence_b)
|
||||
(8, 19)
|
||||
|
||||
Therefore, we can't be put then together in a same tensor as-is. The first sequence needs to be padded up to the length
|
||||
@@ -133,15 +133,14 @@ it to pad like this:
|
||||
|
||||
::
|
||||
|
||||
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
|
||||
padded_sequences["input_ids"]
|
||||
>>> padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
|
||||
|
||||
We can see that 0s have been added on the right of the first sentence to make it the same length as the second one:
|
||||
|
||||
::
|
||||
|
||||
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
|
||||
>>> padded_sequences["input_ids"]
|
||||
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
|
||||
|
||||
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
|
||||
the position of the padded indices so that the model does not attend to them. For the
|
||||
@@ -150,14 +149,8 @@ a padded value. This attention mask is in the dictionary returned by the tokeniz
|
||||
|
||||
::
|
||||
|
||||
padded_sequences["attention_mask"]
|
||||
|
||||
will give back
|
||||
|
||||
::
|
||||
|
||||
[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
|
||||
>>> padded_sequences["attention_mask"]
|
||||
[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
|
||||
|
||||
.. _token-type-ids:
|
||||
|
||||
@@ -170,26 +163,27 @@ tokens. For example, the BERT model builds its two sequence input as such:
|
||||
|
||||
::
|
||||
|
||||
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
|
||||
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
|
||||
|
||||
We can use our tokenizer to automatically generate such a sentence by passing the two sequences as two arguments (and
|
||||
not a list like before) like this:
|
||||
|
||||
::
|
||||
|
||||
from transformers import BertTokenizer
|
||||
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
sequence_a = "HuggingFace is based in NYC"
|
||||
sequence_b = "Where is HuggingFace based?"
|
||||
>>> from transformers import BertTokenizer
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
>>> sequence_a = "HuggingFace is based in NYC"
|
||||
>>> sequence_b = "Where is HuggingFace based?"
|
||||
|
||||
encoded_dict = tokenizer(sequence_a, sequence_b)
|
||||
tokenizer.decode(encoded_dict["input_ids"])
|
||||
>>> encoded_dict = tokenizer(sequence_a, sequence_b)
|
||||
>>> decoded = tokenizer.decode(encoded_dict["input_ids"])
|
||||
|
||||
which will return:
|
||||
|
||||
::
|
||||
|
||||
"[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]"
|
||||
>>> print(decoded)
|
||||
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
|
||||
|
||||
This is enough for some models to understand where one sequence ends and where another begins. However, other models
|
||||
such as BERT have an additional mechanism, which are the token type IDs (also called segment IDs). They are a binary
|
||||
@@ -199,12 +193,7 @@ The tokenizer returns in the dictionary under the key "token_type_ids":
|
||||
|
||||
::
|
||||
|
||||
encoded_dict['token_type_ids']
|
||||
|
||||
will return
|
||||
|
||||
::
|
||||
|
||||
>>> encoded_dict['token_type_ids']
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
|
||||
|
||||
The first sequence, the "context" used for the question, has all its tokens represented by :obj:`0`, whereas the
|
||||
|
||||
Reference in New Issue
Block a user