Small docfile fixes (#6328)
This commit is contained in:
@@ -40,12 +40,12 @@ There are many more parameters that can be configured via the benchmark argument
|
|||||||
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow).
|
``src/transformers/benchmark/benchmark_args_utils.py``, ``src/transformers/benchmark/benchmark_args.py`` (for PyTorch) and ``src/transformers/benchmark/benchmark_args_tf.py`` (for Tensorflow).
|
||||||
Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
|
Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
|
||||||
|
|
||||||
.. code-block::
|
.. code-block:: bash
|
||||||
|
|
||||||
>>> ## PYTORCH CODE
|
## PYTORCH CODE
|
||||||
python examples/benchmarking/run_benchmark.py --help
|
python examples/benchmarking/run_benchmark.py --help
|
||||||
|
|
||||||
>>> ## TENSORFLOW CODE
|
## TENSORFLOW CODE
|
||||||
python examples/benchmarking/run_benchmark_tf.py --help
|
python examples/benchmarking/run_benchmark_tf.py --help
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -20,7 +20,7 @@ work properly.
|
|||||||
To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
|
To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
|
||||||
:func:`~transformers.AutoTokenizer.from_pretrained` method:
|
:func:`~transformers.AutoTokenizer.from_pretrained` method:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
from transformers import AutoTokenizer
|
from transformers import AutoTokenizer
|
||||||
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
|
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
|
||||||
@@ -31,33 +31,24 @@ Base use
|
|||||||
A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
|
A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
|
||||||
is its ``__call__``: you just need to feed your sentence to your tokenizer object.
|
is its ``__call__``: you just need to feed your sentence to your tokenizer object.
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
encoded_input = tokenizer("Hello, I'm a single sentence!")
|
|
||||||
print(encoded_input)
|
|
||||||
|
|
||||||
This will return a dictionary string to list of ints like this one:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> encoded_input = tokenizer("Hello, I'm a single sentence!")
|
||||||
|
>>> print(encoded_input)
|
||||||
{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102],
|
{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102],
|
||||||
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
||||||
|
|
||||||
|
This returns a dictionary string to list of ints.
|
||||||
The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
|
The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
|
||||||
below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
|
below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
|
||||||
:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
|
:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
|
||||||
|
|
||||||
The tokenizer can decode a list of token ids in a proper sentence:
|
The tokenizer can decode a list of token ids in a proper sentence:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
tokenizer.decode(encoded_input["input_ids"])
|
|
||||||
|
|
||||||
which should return
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> tokenizer.decode(encoded_input["input_ids"])
|
||||||
"[CLS] Hello, I'm a single sentence! [SEP]"
|
"[CLS] Hello, I'm a single sentence! [SEP]"
|
||||||
|
|
||||||
As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
|
As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
|
||||||
@@ -69,18 +60,13 @@ those special tokens yourself) by passing ``add_special_tokens=False``.
|
|||||||
If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
|
If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
|
||||||
tokenizer:
|
tokenizer:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
batch_sentences = ["Hello I'm a single sentence",
|
|
||||||
"And another sentence",
|
|
||||||
"And the very very last one"]
|
|
||||||
encoded_inputs = tokenizer(batch_sentences)
|
|
||||||
print(encoded_inputs)
|
|
||||||
|
|
||||||
We get back a dictionary once again, this time with values being list of list of ints:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> batch_sentences = ["Hello I'm a single sentence",
|
||||||
|
... "And another sentence",
|
||||||
|
... "And the very very last one"]
|
||||||
|
>>> encoded_inputs = tokenizer(batch_sentences)
|
||||||
|
>>> print(encoded_inputs)
|
||||||
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
||||||
[101, 1262, 1330, 5650, 102],
|
[101, 1262, 1330, 5650, 102],
|
||||||
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
|
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
|
||||||
@@ -91,6 +77,8 @@ We get back a dictionary once again, this time with values being list of list of
|
|||||||
[1, 1, 1, 1, 1],
|
[1, 1, 1, 1, 1],
|
||||||
[1, 1, 1, 1, 1, 1, 1, 1]]}
|
[1, 1, 1, 1, 1, 1, 1, 1]]}
|
||||||
|
|
||||||
|
We get back a dictionary once again, this time with values being list of list of ints.
|
||||||
|
|
||||||
If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
|
If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
|
||||||
probably want:
|
probably want:
|
||||||
|
|
||||||
@@ -100,19 +88,11 @@ probably want:
|
|||||||
|
|
||||||
You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
|
You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
## PYTORCH CODE
|
|
||||||
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
|
|
||||||
print(batch)
|
|
||||||
## TENSORFLOW CODE
|
|
||||||
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
|
|
||||||
print(batch)
|
|
||||||
|
|
||||||
which should now return a dictionary string to tensor like this:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> ## PYTORCH CODE
|
||||||
|
>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
|
||||||
|
>>> print(batch)
|
||||||
{'input_ids': tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
{'input_ids': tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
||||||
[ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0],
|
[ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0],
|
||||||
[ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]),
|
[ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]),
|
||||||
@@ -122,9 +102,22 @@ which should now return a dictionary string to tensor like this:
|
|||||||
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
|
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
[1, 1, 1, 1, 1, 0, 0, 0, 0],
|
[1, 1, 1, 1, 1, 0, 0, 0, 0],
|
||||||
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
|
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
|
||||||
|
>>> ## TENSORFLOW CODE
|
||||||
|
>>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
|
||||||
|
>>> print(batch)
|
||||||
|
{'input_ids': tf.Tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
||||||
|
[ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0],
|
||||||
|
[ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]),
|
||||||
|
'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
|
[0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
|
[0, 0, 0, 0, 0, 0, 0, 0, 0]]),
|
||||||
|
'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
|
[1, 1, 1, 1, 1, 0, 0, 0, 0],
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
|
||||||
|
|
||||||
We can now see what the `attention_mask <glossary.html#attention-mask>`__ is all about: it points out which tokens the
|
It returns a dictionary string to tensor. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
|
||||||
model should pay attention to and which ones it should not (because they represent padding in this case).
|
all about: it points out which tokens the model should pay attention to and which ones it should not (because they
|
||||||
|
represent padding in this case).
|
||||||
|
|
||||||
|
|
||||||
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
|
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
|
||||||
@@ -137,26 +130,16 @@ Preprocessing pairs of sentences
|
|||||||
|
|
||||||
Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
|
Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
|
||||||
pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
|
pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
|
||||||
then represented like this:
|
then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
[CLS] Sequence A [SEP] Sequence B [SEP]
|
|
||||||
|
|
||||||
You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
|
You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
|
||||||
|
|
||||||
(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
|
(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
|
||||||
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
encoded_input = tokenizer("How old are you?", "I'm 6 years old")
|
|
||||||
print(encoded_input)
|
|
||||||
|
|
||||||
This will once again return a dict string to list of ints:
|
This will once again return a dict string to list of ints:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
|
>>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
|
||||||
|
>>> print(encoded_input)
|
||||||
{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],
|
{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],
|
||||||
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
|
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
|
||||||
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
||||||
@@ -169,34 +152,24 @@ using ``return_input_ids`` or ``return_token_type_ids``.
|
|||||||
|
|
||||||
If we decode the token ids we obtained, we will see that the special tokens have been properly added.
|
If we decode the token ids we obtained, we will see that the special tokens have been properly added.
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
tokenizer.decode(encoded_input["input_ids"])
|
|
||||||
|
|
||||||
will return:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> tokenizer.decode(encoded_input["input_ids"])
|
||||||
"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
|
"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
|
||||||
|
|
||||||
If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
|
If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
|
||||||
list of first sentences and the list of second sentences:
|
list of first sentences and the list of second sentences:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
batch_sentences = ["Hello I'm a single sentence",
|
|
||||||
"And another sentence",
|
|
||||||
"And the very very last one"]
|
|
||||||
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
|
|
||||||
"And I should be encoded with the second sentence",
|
|
||||||
"And I go with the very last one"]
|
|
||||||
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
|
|
||||||
print(encoded_inputs)
|
|
||||||
|
|
||||||
will return a dict with the values being list of lists of ints:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> batch_sentences = ["Hello I'm a single sentence",
|
||||||
|
... "And another sentence",
|
||||||
|
... "And the very very last one"]
|
||||||
|
>>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
|
||||||
|
... "And I should be encoded with the second sentence",
|
||||||
|
... "And I go with the very last one"]
|
||||||
|
>>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
|
||||||
|
>>> print(encoded_inputs)
|
||||||
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
|
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
|
||||||
[101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
|
[101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
|
||||||
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
|
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
|
||||||
@@ -207,17 +180,14 @@ will return a dict with the values being list of lists of ints:
|
|||||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
|
||||||
|
|
||||||
|
As we can see, it returns a dictionary with the values being list of lists of ints.
|
||||||
|
|
||||||
To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
|
To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
for ids in encoded_inputs["input_ids"]:
|
|
||||||
print(tokenizer.decode(ids))
|
|
||||||
|
|
||||||
which will return:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> for ids in encoded_inputs["input_ids"]:
|
||||||
|
>>> print(tokenizer.decode(ids))
|
||||||
[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
|
[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
|
||||||
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
|
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
|
||||||
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
|
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
|
||||||
@@ -225,7 +195,7 @@ which will return:
|
|||||||
Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
|
Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
|
||||||
length the model can accept and return tensors directly with the following:
|
length the model can accept and return tensors directly with the following:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
## PYTORCH CODE
|
## PYTORCH CODE
|
||||||
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
|
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
|
||||||
@@ -316,17 +286,12 @@ predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Na
|
|||||||
`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
|
`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
|
||||||
|
|
||||||
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
|
If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
|
||||||
tokenizer. For instance:
|
tokenizer. For instance, we have:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
|
|
||||||
print(encoded_input)
|
|
||||||
|
|
||||||
will return:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
|
>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_pretokenized=True)
|
||||||
|
>>> print(encoded_input)
|
||||||
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
|
||||||
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
|
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
||||||
@@ -337,7 +302,7 @@ Note that the tokenizer still adds the ids of special tokens (if applicable) unl
|
|||||||
This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
|
This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
|
||||||
like this:
|
like this:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
|
batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
|
||||||
["And", "another", "sentence"],
|
["And", "another", "sentence"],
|
||||||
@@ -346,7 +311,7 @@ like this:
|
|||||||
|
|
||||||
or a batch of pair sentences like this:
|
or a batch of pair sentences like this:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
|
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
|
||||||
["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
|
["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
|
||||||
@@ -355,7 +320,7 @@ or a batch of pair sentences like this:
|
|||||||
|
|
||||||
And you can add padding, truncation as well as directly return tensors like before:
|
And you can add padding, truncation as well as directly return tensors like before:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
## PYTORCH CODE
|
## PYTORCH CODE
|
||||||
batch = tokenizer(batch_sentences,
|
batch = tokenizer(batch_sentences,
|
||||||
|
|||||||
@@ -128,7 +128,7 @@ Under the hood: pretrained models
|
|||||||
Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
|
Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
|
||||||
using the :obj:`from_pretrained` method:
|
using the :obj:`from_pretrained` method:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
>>> ## PYTORCH CODE
|
>>> ## PYTORCH CODE
|
||||||
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||||
@@ -146,7 +146,7 @@ Using the tokenizer
|
|||||||
|
|
||||||
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
|
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
|
||||||
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
|
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
|
||||||
that process (you can learn more about them in the :doc:`tokenizer_summary <tokenizer_summary>`, which is why we need
|
that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`, which is why we need
|
||||||
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
|
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
|
||||||
pretrained.
|
pretrained.
|
||||||
|
|
||||||
@@ -295,7 +295,7 @@ precision, etc.). See the :doc:`training tutorial <training>` for more details.
|
|||||||
|
|
||||||
Once your model is fine-tuned, you can save it with its tokenizer in the following way:
|
Once your model is fine-tuned, you can save it with its tokenizer in the following way:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
tokenizer.save_pretrained(save_directory)
|
tokenizer.save_pretrained(save_directory)
|
||||||
model.save_pretrained(save_directory)
|
model.save_pretrained(save_directory)
|
||||||
@@ -305,14 +305,14 @@ directory name instead of the model name. One cool feature of 🤗 Transformers
|
|||||||
PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow. If you are
|
PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow. If you are
|
||||||
loading a saved PyTorch model in a TensorFlow model, use :func:`~transformers.TFAutoModel.from_pretrained` like this:
|
loading a saved PyTorch model in a TensorFlow model, use :func:`~transformers.TFAutoModel.from_pretrained` like this:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(save_directory)
|
tokenizer = AutoTokenizer.from_pretrained(save_directory)
|
||||||
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)
|
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)
|
||||||
|
|
||||||
and if you are loading a saved TensorFlow model in a PyTorch model, you should use the following code:
|
and if you are loading a saved TensorFlow model in a PyTorch model, you should use the following code:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(save_directory)
|
tokenizer = AutoTokenizer.from_pretrained(save_directory)
|
||||||
model = AutoModel.from_pretrained(save_directory, from_tf=True)
|
model = AutoModel.from_pretrained(save_directory, from_tf=True)
|
||||||
@@ -320,7 +320,7 @@ and if you are loading a saved TensorFlow model in a PyTorch model, you should u
|
|||||||
Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:
|
Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:
|
||||||
|
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
>>> ## PYTORCH CODE
|
>>> ## PYTORCH CODE
|
||||||
>>> pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
|
>>> pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
|
||||||
|
|||||||
@@ -477,7 +477,7 @@ This outputs a (hopefully) coherent next token following the original sequence,
|
|||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
print(resulting_string)
|
>>> print(resulting_string)
|
||||||
Hugging Face is based in DUMBO, New York City, and has
|
Hugging Face is based in DUMBO, New York City, and has
|
||||||
|
|
||||||
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
|
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
|
||||||
@@ -604,8 +604,7 @@ expected results:
|
|||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
print(nlp(sequence))
|
>>> print(nlp(sequence))
|
||||||
|
|
||||||
[
|
[
|
||||||
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
|
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
|
||||||
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
|
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
|
||||||
@@ -803,11 +802,6 @@ translation results nevertheless.
|
|||||||
|
|
||||||
Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
|
Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
|
||||||
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
|
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
|
||||||
This outputs the following translation into German:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
|
|
||||||
|
|
||||||
Here is an example doing translation using a model and a tokenizer. The process is the following:
|
Here is an example doing translation using a model and a tokenizer. The process is the following:
|
||||||
|
|
||||||
|
|||||||
@@ -73,7 +73,7 @@ subwords. This also enables the model to process words it has never seen before,
|
|||||||
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
|
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
|
||||||
this:
|
this:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
>>> from transformers import BertTokenizer
|
>>> from transformers import BertTokenizer
|
||||||
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
@@ -87,7 +87,7 @@ predictions and reverse the tokenization).
|
|||||||
|
|
||||||
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
|
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
|
||||||
|
|
||||||
::
|
.. code-block::
|
||||||
|
|
||||||
>>> from transformers import XLNetTokenizer
|
>>> from transformers import XLNetTokenizer
|
||||||
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
|
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
|
||||||
|
|||||||
@@ -16,10 +16,10 @@ TF2, and focus specifically on the nuances and tools for training models in
|
|||||||
|
|
||||||
Sections:
|
Sections:
|
||||||
|
|
||||||
* :ref:`pytorch`
|
- :ref:`pytorch`
|
||||||
* :ref:`tensorflow`
|
- :ref:`tensorflow`
|
||||||
* :ref:`trainer`
|
- :ref:`trainer`
|
||||||
* :ref:`additional-resources`
|
- :ref:`additional-resources`
|
||||||
|
|
||||||
.. _pytorch:
|
.. _pytorch:
|
||||||
|
|
||||||
@@ -131,7 +131,6 @@ Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``.
|
|||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
...
|
|
||||||
loss.backward()
|
loss.backward()
|
||||||
optimizer.step()
|
optimizer.step()
|
||||||
scheduler.step()
|
scheduler.step()
|
||||||
@@ -182,6 +181,7 @@ the pretrained tokenizer name.
|
|||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
from transformers import BertTokenizer, glue_convert_examples_to_features
|
from transformers import BertTokenizer, glue_convert_examples_to_features
|
||||||
|
import tensorflow as tf
|
||||||
import tensorflow_datasets as tfds
|
import tensorflow_datasets as tfds
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
||||||
data = tfds.load('glue/mrpc')
|
data = tfds.load('glue/mrpc')
|
||||||
@@ -305,19 +305,14 @@ launching tensorboard in your specified ``logging_dir`` directory.
|
|||||||
Additional resources
|
Additional resources
|
||||||
^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
* `A lightweight colab demo
|
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
|
||||||
<https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
|
which uses ``Trainer`` for IMDb sentiment classification.
|
||||||
which uses ``Trainer`` for IMDb sentiment classification.
|
|
||||||
|
|
||||||
* `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_
|
- `🤗 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_
|
||||||
including scripts for training and fine-tuning on GLUE, SQuAD, and
|
including scripts for training and fine-tuning on GLUE, SQuAD, and several other tasks.
|
||||||
several other tasks.
|
|
||||||
|
|
||||||
* `How to train a language model
|
- `How to train a language model <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_,
|
||||||
<https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_,
|
a detailed colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
|
||||||
a detailed colab notebook which uses ``Trainer`` to train a masked
|
|
||||||
language model from scratch on Esperanto.
|
|
||||||
|
|
||||||
* `🤗 Transformers Notebooks <./notebooks.html>`_ which contain dozens
|
- `🤗 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for
|
||||||
of example notebooks from the community for training and using
|
training and using 🤗 Transformers on a variety of tasks.
|
||||||
🤗 Transformers on a variety of tasks.
|
|
||||||
|
|||||||
Reference in New Issue
Block a user