updating docs - adding few tests to tokenizers

2019-08-04 22:42:55 +02:00
parent 009273dbdd
commit 00132b7a7a
10 changed files with 390 additions and 521 deletions
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@@ -1,6 +1,14 @@
 Tokenizer
 ----------------------------------------------------

+The base class ``PreTrainedTokenizer`` implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace's AWS S3 repository).
+
+``PreTrainedTokenizer`` is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:
+
+- tokenizing, converting tokens to ids and back and encoding/decoding,
+- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
+- managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)
+
 ``PreTrainedTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~