Minor documentation revisions from copyediting (#9266)
* typo: Revise "checkout" to "check out"
* typo: Change "seemlessly" to "seamlessly"
* typo: Close parentheses in "Using the tokenizer"
* typo: Add closing parenthesis to supported models aside
* docs: Treat ``position_ids`` as plural
Alternatively, the word "argument" could be added to make the subject singular.
* docs: Remove comma, making subordinate clause
* docs: Remove comma separating verb and direct object
* docs: Fix typo ("next" -> "text")
* docs: Reverse phrase order to simplify sentence
* docs: "quicktour" -> "quick tour"
* docs: "to throw" -> "from throwing"
* docs: Remove disruptive newline in padding/truncation section
* docs: "show exemplary" -> "show examples of"
* docs: "much harder as" -> "much harder than"
* docs: Fix typo "seach" -> "search"
* docs: Fix subject-verb disagreement in WordPiece description
* docs: Fix style in preprocessing.rst
This commit is contained in:
@@ -226,7 +226,7 @@ Contrary to RNNs that have the position of each token embedded within them, tran
|
|||||||
each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
|
each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
|
||||||
the list of tokens.
|
the list of tokens.
|
||||||
|
|
||||||
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
|
They are an optional parameter. If no ``position_ids`` are passed to the model, the IDs are automatically created as
|
||||||
absolute positional embeddings.
|
absolute positional embeddings.
|
||||||
|
|
||||||
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
|
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
|
||||||
|
|||||||
@@ -16,7 +16,7 @@ Summary of the models
|
|||||||
This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
|
This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
|
||||||
model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
|
model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
|
||||||
<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
|
<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
|
||||||
models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
|
models. You can check them more in detail in their respective documentation. Also check out the :doc:`pretrained model
|
||||||
page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
|
page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
|
||||||
<https://huggingface.co/models>`_.
|
<https://huggingface.co/models>`_.
|
||||||
|
|
||||||
@@ -30,7 +30,7 @@ Each one of the models in the library falls into one of the following categories
|
|||||||
|
|
||||||
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
|
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
|
||||||
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
|
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
|
||||||
sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
|
sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those
|
||||||
models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
|
models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
|
||||||
typical example of such models is GPT.
|
typical example of such models is GPT.
|
||||||
|
|
||||||
@@ -512,8 +512,8 @@ BART
|
|||||||
<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
|
<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
|
||||||
|
|
||||||
Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
|
Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
|
||||||
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder
|
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of
|
||||||
, on the pretraining tasks, a composition of the following transformations are applied:
|
the following transformations are applied on the pretraining tasks for the encoder:
|
||||||
|
|
||||||
* mask random tokens (like in BERT)
|
* mask random tokens (like in BERT)
|
||||||
* delete random tokens
|
* delete random tokens
|
||||||
|
|||||||
@@ -78,7 +78,7 @@ The library is built around three types of classes for each model:
|
|||||||
All these classes can be instantiated from pretrained instances and saved locally using two methods:
|
All these classes can be instantiated from pretrained instances and saved locally using two methods:
|
||||||
|
|
||||||
- :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
|
- :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
|
||||||
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
|
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`) or
|
||||||
stored locally (or on a server) by the user,
|
stored locally (or on a server) by the user,
|
||||||
- :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
|
- :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
|
||||||
:obj:`from_pretrained()`.
|
:obj:`from_pretrained()`.
|
||||||
|
|||||||
@@ -17,10 +17,10 @@ In this tutorial, we'll explore how to preprocess your data using 🤗 Transform
|
|||||||
call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
|
call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
|
||||||
you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
|
you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
|
||||||
|
|
||||||
As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
|
As we saw in the :doc:`quick tour </quicktour>`, the tokenizer will first split a given text in words (or part of
|
||||||
punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
|
words, punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able
|
||||||
build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
|
to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
|
||||||
work properly.
|
to work properly.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
@@ -131,7 +131,7 @@ ones it should not (because they represent padding in this case).
|
|||||||
|
|
||||||
|
|
||||||
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
|
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
|
||||||
can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
|
can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer from throwing those kinds of warnings.
|
||||||
|
|
||||||
.. _sentence-pairs:
|
.. _sentence-pairs:
|
||||||
|
|
||||||
@@ -216,7 +216,6 @@ Everything you always wanted to know about padding and truncation
|
|||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
|
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
|
||||||
|
|
||||||
truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
|
truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
|
||||||
three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
|
three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
|
||||||
|
|
||||||
|
|||||||
@@ -158,7 +158,7 @@ Using the tokenizer
|
|||||||
|
|
||||||
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
|
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
|
||||||
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
|
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
|
||||||
that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`, which is why we need
|
that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`), which is why we need
|
||||||
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
|
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
|
||||||
pretrained.
|
pretrained.
|
||||||
|
|
||||||
|
|||||||
@@ -327,7 +327,7 @@ Masked Language Modeling
|
|||||||
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
|
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
|
||||||
fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
|
fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
|
||||||
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
|
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
|
||||||
downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
|
downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
|
||||||
<https://arxiv.org/abs/1910.13461>`__, part 4.2).
|
<https://arxiv.org/abs/1910.13461>`__, part 4.2).
|
||||||
|
|
||||||
Here is an example of using pipelines to replace a mask from a sequence:
|
Here is an example of using pipelines to replace a mask from a sequence:
|
||||||
@@ -657,7 +657,7 @@ Here are the expected results:
|
|||||||
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
|
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
|
||||||
]
|
]
|
||||||
|
|
||||||
Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
|
Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
|
||||||
"DUMBO" and "Manhattan Bridge" have been identified as locations.
|
"DUMBO" and "Manhattan Bridge" have been identified as locations.
|
||||||
|
|
||||||
Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
|
Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
|
||||||
|
|||||||
@@ -18,7 +18,7 @@ On this page, we will have a closer look at tokenization. As we saw in :doc:`the
|
|||||||
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
|
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
|
||||||
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
|
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
|
||||||
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
|
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
|
||||||
and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
|
and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
|
||||||
|
|
||||||
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
|
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
|
||||||
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
|
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
|
||||||
@@ -72,7 +72,7 @@ greater than 50,000, especially if they are pretrained only on a single language
|
|||||||
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
|
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
|
||||||
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
|
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
|
||||||
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
|
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
|
||||||
for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
|
for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
|
||||||
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
|
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
|
||||||
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
|
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
|
||||||
|
|
||||||
@@ -202,10 +202,10 @@ WordPiece
|
|||||||
|
|
||||||
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
|
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
|
||||||
<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
|
<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
|
||||||
Voice Seach (Schuster et al., 2012)
|
Voice Search (Schuster et al., 2012)
|
||||||
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
|
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
|
||||||
BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
|
BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
|
||||||
progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
|
progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
|
||||||
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
|
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
|
||||||
|
|
||||||
So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
|
So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ Training and fine-tuning
|
|||||||
=======================================================================================================================
|
=======================================================================================================================
|
||||||
|
|
||||||
Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
|
Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
|
||||||
seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
|
seamlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
|
||||||
standard training tools available in either framework. We will also show how to use our included
|
standard training tools available in either framework. We will also show how to use our included
|
||||||
:func:`~transformers.Trainer` class which handles much of the complexity of training for you.
|
:func:`~transformers.Trainer` class which handles much of the complexity of training for you.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user