Minor documentation revisions from copyediting (#9266)
* typo: Revise "checkout" to "check out"
* typo: Change "seemlessly" to "seamlessly"
* typo: Close parentheses in "Using the tokenizer"
* typo: Add closing parenthesis to supported models aside
* docs: Treat ``position_ids`` as plural
Alternatively, the word "argument" could be added to make the subject singular.
* docs: Remove comma, making subordinate clause
* docs: Remove comma separating verb and direct object
* docs: Fix typo ("next" -> "text")
* docs: Reverse phrase order to simplify sentence
* docs: "quicktour" -> "quick tour"
* docs: "to throw" -> "from throwing"
* docs: Remove disruptive newline in padding/truncation section
* docs: "show exemplary" -> "show examples of"
* docs: "much harder as" -> "much harder than"
* docs: Fix typo "seach" -> "search"
* docs: Fix subject-verb disagreement in WordPiece description
* docs: Fix style in preprocessing.rst
This commit is contained in:
@@ -226,7 +226,7 @@ Contrary to RNNs that have the position of each token embedded within them, tran
|
||||
each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
|
||||
the list of tokens.
|
||||
|
||||
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
|
||||
They are an optional parameter. If no ``position_ids`` are passed to the model, the IDs are automatically created as
|
||||
absolute positional embeddings.
|
||||
|
||||
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
|
||||
|
||||
@@ -30,7 +30,7 @@ Each one of the models in the library falls into one of the following categories
|
||||
|
||||
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
|
||||
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
|
||||
sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
|
||||
sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those
|
||||
models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
|
||||
typical example of such models is GPT.
|
||||
|
||||
@@ -512,8 +512,8 @@ BART
|
||||
<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
|
||||
|
||||
Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
|
||||
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder
|
||||
, on the pretraining tasks, a composition of the following transformations are applied:
|
||||
fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of
|
||||
the following transformations are applied on the pretraining tasks for the encoder:
|
||||
|
||||
* mask random tokens (like in BERT)
|
||||
* delete random tokens
|
||||
|
||||
@@ -78,7 +78,7 @@ The library is built around three types of classes for each model:
|
||||
All these classes can be instantiated from pretrained instances and saved locally using two methods:
|
||||
|
||||
- :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
|
||||
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
|
||||
provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`) or
|
||||
stored locally (or on a server) by the user,
|
||||
- :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
|
||||
:obj:`from_pretrained()`.
|
||||
|
||||
@@ -17,10 +17,10 @@ In this tutorial, we'll explore how to preprocess your data using 🤗 Transform
|
||||
call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
|
||||
you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
|
||||
|
||||
As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
|
||||
punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
|
||||
build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
|
||||
work properly.
|
||||
As we saw in the :doc:`quick tour </quicktour>`, the tokenizer will first split a given text in words (or part of
|
||||
words, punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able
|
||||
to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
|
||||
to work properly.
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -131,7 +131,7 @@ ones it should not (because they represent padding in this case).
|
||||
|
||||
|
||||
Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
|
||||
can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
|
||||
can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer from throwing those kinds of warnings.
|
||||
|
||||
.. _sentence-pairs:
|
||||
|
||||
@@ -216,7 +216,6 @@ Everything you always wanted to know about padding and truncation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
|
||||
|
||||
truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
|
||||
three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
|
||||
|
||||
|
||||
@@ -158,7 +158,7 @@ Using the tokenizer
|
||||
|
||||
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
|
||||
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
|
||||
that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`, which is why we need
|
||||
that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`), which is why we need
|
||||
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
|
||||
pretrained.
|
||||
|
||||
|
||||
@@ -327,7 +327,7 @@ Masked Language Modeling
|
||||
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
|
||||
fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
|
||||
right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
|
||||
downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
|
||||
downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
|
||||
<https://arxiv.org/abs/1910.13461>`__, part 4.2).
|
||||
|
||||
Here is an example of using pipelines to replace a mask from a sequence:
|
||||
@@ -657,7 +657,7 @@ Here are the expected results:
|
||||
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
|
||||
]
|
||||
|
||||
Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
|
||||
Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
|
||||
"DUMBO" and "Manhattan Bridge" have been identified as locations.
|
||||
|
||||
Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
|
||||
|
||||
@@ -18,7 +18,7 @@ On this page, we will have a closer look at tokenization. As we saw in :doc:`the
|
||||
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
|
||||
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
|
||||
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
|
||||
and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
|
||||
and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
|
||||
|
||||
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
|
||||
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
|
||||
@@ -72,7 +72,7 @@ greater than 50,000, especially if they are pretrained only on a single language
|
||||
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
|
||||
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
|
||||
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
|
||||
for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
|
||||
for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
|
||||
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
|
||||
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
|
||||
|
||||
@@ -202,10 +202,10 @@ WordPiece
|
||||
|
||||
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
|
||||
<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
|
||||
Voice Seach (Schuster et al., 2012)
|
||||
Voice Search (Schuster et al., 2012)
|
||||
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
|
||||
BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
|
||||
progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
|
||||
progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
|
||||
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
|
||||
|
||||
So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
|
||||
|
||||
@@ -14,7 +14,7 @@ Training and fine-tuning
|
||||
=======================================================================================================================
|
||||
|
||||
Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
|
||||
seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
|
||||
seamlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
|
||||
standard training tools available in either framework. We will also show how to use our included
|
||||
:func:`~transformers.Trainer` class which handles much of the complexity of training for you.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user