From 1a113fcf65c9231d3aece817960fbfe521e7e275 Mon Sep 17 00:00:00 2001 From: Belladore <135602125+belladoreai@users.noreply.github.com> Date: Thu, 15 Jun 2023 18:31:47 +0300 Subject: [PATCH] Update tokenizer_summary.mdx (grammar) (#24286) --- docs/source/en/tokenizer_summary.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/tokenizer_summary.mdx b/docs/source/en/tokenizer_summary.mdx index 942fe27906..61099edabe 100644 --- a/docs/source/en/tokenizer_summary.mdx +++ b/docs/source/en/tokenizer_summary.mdx @@ -141,7 +141,7 @@ words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](mode [FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses Spacy and ftfy, to count the frequency of each word in the training corpus. -After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the +After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to