[Tokenizer Doc] Improve tokenizer summary (#8622)

* improve summary * small fixes * cleaned line length * correct "" formatting * apply sylvains suggestions
2020-11-18 17:14:15 +01:00
parent 2f9d49b389
commit cdfa56afe0
1 changed files with 159 additions and 136 deletions
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@@ -1,223 +1,243 @@
-Tokenizer summary
+Summary of the tokenizers
 -----------------------------------------------------------------------------------------------------------------------
-In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
+On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
-<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
+<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
-part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
+look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
-different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
+text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
-:ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
+tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
-those.
+and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
-Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
+Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
-algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
+type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
-using :ref:`WordPiece <wordpiece>`.
+that the model uses :ref:`WordPiece <wordpiece>`.
-Introduction to tokenization
+Introduction
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
+Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
-instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
+For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."`` A simple way of tokenizing
-text is just to split it by spaces, which would give:
+this text is to split it by spaces, which would give:
 .. code-block::
    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
-This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
+This is a sensible first step, but if we look at the tokens ``"Transformers?"`` and ``"do."``, we notice that the
-will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
+punctuation is attached to the words ``"Transformer"`` and ``"do"``, which is suboptimal. We should take the
-into account. This would give:
+punctuation into account so that a model does not have to learn a different representation of a word and every possible
 punctuation symbol that could follow it, which would explode the number of representations the model has to learn.
 Taking punctuation into account, tokenizing our exemplary text would give:
 .. code-block::
    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
+Better. However, it is disadvantageous, how the tokenization dealt with the word ``"Don't"``. ``"Don't"`` stands for
-it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
+``"do not"``, so it would be better tokenized as ``["Do", "n't"]``. This is where things start getting complicated, and
-part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
+part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a
-into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
+different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an
-perform properly if you don't use the exact same rules as the persons who pretrained it.
+input that was tokenized with the same rules that were used to tokenize its training data.
 `spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
-rule-based tokenizers. On the text above, they'd output something like:
+rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like:
 .. code-block::
    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
+As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and
-sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
+punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
-you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
+as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
-XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
+tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
 usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, :doc:`Transformer XL
 <model_doc/transformerxl>` uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
-A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
+Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
-TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
+causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
-transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
+greater than 50,000, especially if they are pretrained only on a single language.
 language.
-So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
+So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
-While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
+character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
-as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
+the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
-all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
+for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
 Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
 transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
 Subword tokenization
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
+Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
-should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
+subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
-decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
+considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
-form (almost) arbitrarily long complex words by stringing together some subwords.
+stand-alone subwords would appear more frequently while at the same time the meaning of ``"annoyingly"`` is kept by the
 composite meaning of ``"annoying"`` and ``"ly"``. This is especially useful in agglutinative languages such as Turkish,
 where you can form (almost) arbitrarily long complex words by stringing together subwords.
-This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
+Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful
-subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
+context-independent representations. In addition, subword tokenization enables the model to process words it has never
-knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
+seen before, by decomposing them into known subwords. For instance, the :class:`~transformers.BertTokenizer` tokenizes
 ``"I have a new GPU!"`` as follows:
 .. code-block::
    >>> from transformers import BertTokenizer
-    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    >>> tokenizer.tokenize("I have a new GPU!")
-    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
+    ["i", "have", "a", "new", "gp", "##u", "!"]
-Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
+Because we are considering the uncased model, the sentence was lowercased first. We can see that the words ``["i",
-vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
+"have", "a", "new"]`` are present in the tokenizer's vocabulary, but the word ``"gpu"`` is not. Consequently, the
-"##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
+tokenizer splits ``"gpu"`` into known subwords: ``["gp" and "##u"]``. ``"##"`` means that the rest of the token should
-predictions and reverse the tokenization).
+be attached to the previous one, without space (for decoding or reversal of the tokenization).
-Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
+As another example, :class:`~transformers.XLNetTokenizer` tokenizes our previously exemplary text as follows:
 .. code-block::
    >>> from transformers import XLNetTokenizer
-    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+    >>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
-    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
+    ["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]
-We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
+We'll get back to the meaning of those ``"▁"`` when we look at :ref:`SentencePiece <sentencepiece>`. As one can see,
-Transformers has been split into "Transform" and "ers".
+the rare word ``"Transformers"`` has been split into the more frequent subwords ``"Transform"`` and ``"ers"``.
-Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
+Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization
-training which is usually done on the corpus the corresponding model will be trained on.
+algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained
 on.
 .. _byte-pair-encoding:
-Byte-Pair Encoding
+Byte-Pair Encoding (BPE)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
+Byte-Pair Encoding (BPE) was introduced in `Neural Machine Translation of Rare Words with Subword Units (Sennrich et
-splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
+al., 2015) <https://arxiv.org/abs/1508.07909>`__. BPE relies on a pre-tokenizer that splits the training data into
-:doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
+words. Pretokenization can be as simple as space tokenization, e.g. :doc:`GPT-2 <model_doc/gpt2>`, :doc:`Roberta
-Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
+<model_doc/roberta>`. More advanced pre-tokenization include rule-based tokenization, e.g. :doc:`XLM <model_doc/xlm>`,
 :doc:`FlauBERT <model_doc/flaubert>` which uses Moses for most languages, or :doc:`GPT <model_doc/gpt>` which uses
 Spacy and ftfy, to count the frequency of each word in the training corpus.
-:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
+After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the
 training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
 of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
 the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
 define before training the tokenizer.
-It then begins from the list of all characters and will learn merge rules to form a new token from two symbols in the
+As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been
-vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
+determined:
 Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
 word):
 .. code-block::
-    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
+    ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
-Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
+Consequently, the base vocabulary is ``["b", "g", "h", "n", "p", "s", "u"]``. Splitting all words into symbols of the
 base vocabulary, we obtain:
 .. code-block::
-    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
+    ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
-We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
+BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
-times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
+the example above ``"h"`` followed by ``"u"`` is present `10 + 5 = 15` times (10 times in the 10 occurrences of
-`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
+``"hug"``, 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is ``"u"`` followed by "g",
-then it adds 'ug' to the vocabulary. Our corpus then becomes
+occurring `10 + 5 + 5 = 20` times in total. Thus, the first merge rule the tokenizer learns is to group all ``"u"``
 symbols followed by a ``"g"`` symbol together. Next, "ug" is added to the vocabulary. The set of words then becomes
 .. code-block::
-    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
+    ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
-and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
+BPE then identifies the next most common symbol pair. It's ``"u"`` followed by ``"n"``, which occurs 16 times. ``"u"``,
-and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
+``"n"`` is merged to ``"un"`` and added to the vocabulary. The next most frequent symbol pair is ``"h"`` followed by
-to the vocabulary.
+``"ug"``, occurring 15 times. Again the pair is merged and ``"hug"`` can be added to the vocabulary.
-At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
+At this stage, the vocabulary is ``["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`` and our set of unique words
-represented as
+is represented as
 .. code-block::
-    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
+    ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
-If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
+Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied
-that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
+to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance,
-tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
+the word ``"bug"`` would be tokenized to ``["b", "ug"]`` but ``"mug"`` would be tokenized as ``["<unk>", "ug"]`` since
-(since the base corpus uses all of them), but to special characters like emojis.
+the symbol ``"m"`` is not in the base vocabulary. In general, single letters such as ``"m"`` are not replaced by the
 ``"<unk>"`` symbol because the training data usually includes at least one occurrence of each letter, but it is likely
 to happen for very special characters like emojis.
-As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
+As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
 to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
-and chose to stop the training of the tokenizer at 40,000 merges.
+and chose to stop training after 40,000 merges.
 Byte-level BPE
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
+A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are
-all unicode characters, the `GPT-2 paper
+considered as base characters. To have a better base vocabulary, `GPT-2
-<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
+<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ uses bytes
-clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
+as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that
-deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
+every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's
-the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
+tokenizer can tokenize every text without the need for the <unk> symbol. :doc:`GPT-2 <model_doc/gpt>` has a vocabulary
-a special end-of-text token and the symbols learned with 50,000 merges.
+size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned
 with 50,000 merges.
 .. _wordpiece:
 WordPiece
 =======================================================================================================================
-WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
+WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
-<model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
+<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
-<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
+Voice Seach (Schuster et al., 2012)
-base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
+<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
-given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
+BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
-that will maximize the likelihood on the corpus once merged.
+progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
 symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
-What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
+So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
-having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
+equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
-subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
+its second symbol is the greatest among all symbol pairs. *E.g.* ``"u"``, followed by ``"g"`` would have only been
-sure it's `worth it`.
+merged if the probability of ``"ug"`` divided by ``"u"``, ``"g"`` would have been greater than for any other symbol
 pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it `loses` by merging two symbols
 to make ensure it's `worth it`.
 .. _unigram:
 Unigram
 =======================================================================================================================
-Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
+Unigram is a subword tokenization algorithm introduced in `Subword Regularization: Improving Neural Network Translation
-Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
+Models with Multiple Subword Candidates (Kudo, 2018) <https://arxiv.org/pdf/1804.10959.pdf>`__. In contrast to BPE or
-from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
+WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each
-progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
+symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and
-with :ref:`SentencePiece <sentencepiece>`.
+the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in
 conjunction with :ref:`SentencePiece <sentencepiece>`.
-More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
+At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
-for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
+data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm
-sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
+computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then
-removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
+removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those
-has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
+symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has
-like BPE or WordPiece).
+reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
-Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
+Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
-tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
+tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
 vocabulary
 .. code-block::
-    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
+    ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
-we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
+``"hugs"`` could be tokenized both as ``["hug", "s"]``, ``["h", "ug", "s"]`` or ``["h", "u", "g", "s"]``. So which one
-one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
+to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that
-training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
+the probability of each possible tokenization can be computed after training. The algorithm simply picks the most
-tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
+likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their
-of the tokenization according to their probabilities).
+probabilities.
-Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
+Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
-x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
+the words :math:`x_{1}, \dots, x_{N}` and that the set of all possible tokenizations for a word :math:`x_{i}` is
-:math:`x_{i}` (with the current vocabulary), then the loss is defined as
+defined as :math:`S(x_{i})`, then the overall loss is defined as
 .. math::
    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
@@ -227,15 +247,18 @@ x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all
 SentencePiece
 =======================================================================================================================
-All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
+All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
-all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
+separate words. However, not all languages use spaces to separate words. One possible solution is to use language
-pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
+specific pre-tokenizers, *e.g.* :doc:`XLM <model_doc/xlm>` uses a specific Chinese, Japanese, and Thai pre-tokenizer).
-SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
+To solve this problem more generally, `SentencePiece: A simple and language independent subword tokenizer and
-includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
+detokenizer for Neural Text Processing (Kudo et al., 2018) <https://arxiv.org/pdf/1808.06226.pdf>`__ treats the input
 as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
 algorithm to construct the appropriate vocabulary.
-That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
+The :class:`~transformers.XLNetTokenizer` uses SentencePiece for example, which is also why in the example earlier the
-the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
+``"▁"`` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be
-of them together and replace '▁' with space.
+concatenated and ``"▁"`` is replaced by a space.
-All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
+All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models
-:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
+using SentencePiece are :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>`, :doc:`Marian
 <model_doc/marian>`, and :doc:`T5 <model_doc/t5>`.