Tokenizer summary (#5467)

* Work on tokenizer summary

* Finish tutorial

* Link to it

* Apply suggestions from code review

Co-authored-by: Anthony MOI <xn1t0x@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add vocab definition

Co-authored-by: Anthony MOI <xn1t0x@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This commit is contained in:
Sylvain Gugger
2020-07-02 17:07:42 -04:00
committed by GitHub
parent ef0e9d806c
commit 6b735a7253
3 changed files with 247 additions and 2 deletions

View File

@@ -146,8 +146,9 @@ Using the tokenizer
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the
same rules as when the model was pretrained.
that process (you can learn more about them in the :doc:`tokenizer_summary <tokenizer_summary>`, which is why we need
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
pretrained.
The second step is to convert those `tokens` into numbers, to be able to build a tensor out of them and feed them to
the model. To do this, the tokenizer has a `vocab`, which is the part we download when we instantiate it with the