Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
This commit is contained in:
@@ -1,12 +1,12 @@
|
||||
Tokenizer summary
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
In this page, we will have a closer look at tokenization. As we saw in
|
||||
:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
|
||||
are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
|
||||
specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
|
||||
:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
|
||||
:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
|
||||
In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
|
||||
<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
|
||||
part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
|
||||
different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
|
||||
:ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
|
||||
those.
|
||||
|
||||
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
|
||||
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
|
||||
@@ -16,8 +16,8 @@ Introduction to tokenization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
|
||||
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
|
||||
this text is just to split it by spaces, which would give:
|
||||
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
|
||||
text is just to split it by spaces, which would give:
|
||||
|
||||
.. code-block::
|
||||
|
||||
@@ -46,9 +46,8 @@ rule-based tokenizers. On the text above, they'd output something like:
|
||||
|
||||
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
|
||||
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
|
||||
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
|
||||
:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
|
||||
size of 267,735!
|
||||
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
|
||||
XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
|
||||
|
||||
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
|
||||
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
|
||||
@@ -69,9 +68,8 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la
|
||||
form (almost) arbitrarily long complex words by stringing together some subwords.
|
||||
|
||||
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
|
||||
subwords. This also enables the model to process words it has never seen before, by decomposing them into
|
||||
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
|
||||
this:
|
||||
subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
|
||||
knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
|
||||
|
||||
.. code-block::
|
||||
|
||||
@@ -81,8 +79,8 @@ this:
|
||||
['i', 'have', 'a', 'new', 'gp', '##u', '!']
|
||||
|
||||
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
|
||||
vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The "##"
|
||||
means that the rest of the token should be attached to the previous one, without space (for when we need to decode
|
||||
vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
|
||||
"##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
|
||||
predictions and reverse the tokenization).
|
||||
|
||||
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
|
||||
@@ -106,9 +104,9 @@ Byte-Pair Encoding
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
|
||||
splitting the training data into words, which can be a simple space tokenization
|
||||
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
|
||||
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
|
||||
splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
|
||||
:doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
|
||||
Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
|
||||
|
||||
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
|
||||
|
||||
@@ -148,10 +146,10 @@ represented as
|
||||
|
||||
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
|
||||
|
||||
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
|
||||
were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
|
||||
``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
|
||||
base corpus uses all of them), but to special characters like emojis.
|
||||
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
|
||||
that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
|
||||
tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
|
||||
(since the base corpus uses all of them), but to special characters like emojis.
|
||||
|
||||
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
|
||||
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
|
||||
@@ -161,24 +159,24 @@ Byte-level BPE
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
|
||||
all unicode characters, the
|
||||
`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
|
||||
introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
|
||||
additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
|
||||
token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
|
||||
256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
|
||||
all unicode characters, the `GPT-2 paper
|
||||
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
|
||||
clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
|
||||
deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
|
||||
the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
|
||||
a special end-of-text token and the symbols learned with 50,000 merges.
|
||||
|
||||
.. _wordpiece:
|
||||
|
||||
WordPiece
|
||||
=======================================================================================================================
|
||||
|
||||
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
|
||||
:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
|
||||
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
|
||||
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
|
||||
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
|
||||
frequent but the one that will maximize the likelihood on the corpus once merged.
|
||||
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
|
||||
<model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
|
||||
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
|
||||
base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
|
||||
given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
|
||||
that will maximize the likelihood on the corpus once merged.
|
||||
|
||||
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
|
||||
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
|
||||
@@ -198,10 +196,10 @@ with :ref:`SentencePiece <sentencepiece>`.
|
||||
|
||||
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
|
||||
for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
|
||||
sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and removes
|
||||
all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
|
||||
reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
|
||||
BPE or WordPiece).
|
||||
sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
|
||||
removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
|
||||
has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
|
||||
like BPE or WordPiece).
|
||||
|
||||
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
|
||||
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
|
||||
@@ -217,9 +215,9 @@ training corpus. You can then give a probability to each tokenization (which is
|
||||
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
|
||||
of the tokenization according to their probabilities).
|
||||
|
||||
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
|
||||
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
|
||||
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
|
||||
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
|
||||
x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
|
||||
:math:`x_{i}` (with the current vocabulary), then the loss is defined as
|
||||
|
||||
.. math::
|
||||
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
|
||||
@@ -236,8 +234,8 @@ SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`
|
||||
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
|
||||
|
||||
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
|
||||
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
|
||||
all of them together and replace '▁' with space.
|
||||
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
|
||||
of them together and replace '▁' with space.
|
||||
|
||||
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
|
||||
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
|
||||
|
||||
Reference in New Issue
Block a user