Minor documentation revisions from copyediting (#9266)
* typo: Revise "checkout" to "check out"
* typo: Change "seemlessly" to "seamlessly"
* typo: Close parentheses in "Using the tokenizer"
* typo: Add closing parenthesis to supported models aside
* docs: Treat ``position_ids`` as plural
Alternatively, the word "argument" could be added to make the subject singular.
* docs: Remove comma, making subordinate clause
* docs: Remove comma separating verb and direct object
* docs: Fix typo ("next" -> "text")
* docs: Reverse phrase order to simplify sentence
* docs: "quicktour" -> "quick tour"
* docs: "to throw" -> "from throwing"
* docs: Remove disruptive newline in padding/truncation section
* docs: "show exemplary" -> "show examples of"
* docs: "much harder as" -> "much harder than"
* docs: Fix typo "seach" -> "search"
* docs: Fix subject-verb disagreement in WordPiece description
* docs: Fix style in preprocessing.rst
This commit is contained in:
@@ -18,7 +18,7 @@ On this page, we will have a closer look at tokenization. As we saw in :doc:`the
|
||||
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
|
||||
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
|
||||
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
|
||||
and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
|
||||
and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
|
||||
|
||||
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
|
||||
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
|
||||
@@ -72,7 +72,7 @@ greater than 50,000, especially if they are pretrained only on a single language
|
||||
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
|
||||
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
|
||||
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
|
||||
for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
|
||||
for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
|
||||
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
|
||||
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
|
||||
|
||||
@@ -202,10 +202,10 @@ WordPiece
|
||||
|
||||
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
|
||||
<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
|
||||
Voice Seach (Schuster et al., 2012)
|
||||
Voice Search (Schuster et al., 2012)
|
||||
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
|
||||
BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
|
||||
progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
|
||||
progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
|
||||
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
|
||||
|
||||
So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
|
||||
|
||||
Reference in New Issue
Block a user