diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst index 5d00021eda..2e8e43e563 100644 --- a/docs/source/glossary.rst +++ b/docs/source/glossary.rst @@ -226,7 +226,7 @@ Contrary to RNNs that have the position of each token embedded within them, tran each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens. -They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as +They are an optional parameter. If no ``position_ids`` are passed to the model, the IDs are automatically created as absolute positional embeddings. Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use diff --git a/docs/source/model_summary.rst b/docs/source/model_summary.rst index 6f66958737..5567c49e29 100644 --- a/docs/source/model_summary.rst +++ b/docs/source/model_summary.rst @@ -16,7 +16,7 @@ Summary of the models This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer model `_. For a gentle introduction check the `annotated transformer `_. Here we focus on the high-level differences between the -models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model +models. You can check them more in detail in their respective documentation. Also check out the :doc:`pretrained model page ` to see the checkpoints available for each type of model and all `the community models `_. @@ -30,7 +30,7 @@ Each one of the models in the library falls into one of the following categories Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full -sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those +sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A typical example of such models is GPT. @@ -512,8 +512,8 @@ BART `_, Mike Lewis et al. Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is -fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder -, on the pretraining tasks, a composition of the following transformations are applied: +fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of +the following transformations are applied on the pretraining tasks for the encoder: * mask random tokens (like in BERT) * delete random tokens diff --git a/docs/source/philosophy.rst b/docs/source/philosophy.rst index 8b06c981d5..644ef51c6b 100644 --- a/docs/source/philosophy.rst +++ b/docs/source/philosophy.rst @@ -78,7 +78,7 @@ The library is built around three types of classes for each model: All these classes can be instantiated from pretrained instances and saved locally using two methods: - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either - provided by the library itself (the supported models are provided in the list :doc:`here ` or + provided by the library itself (the supported models are provided in the list :doc:`here `) or stored locally (or on a server) by the user, - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using :obj:`from_pretrained()`. diff --git a/docs/source/preprocessing.rst b/docs/source/preprocessing.rst index 5313ff1262..773f84783d 100644 --- a/docs/source/preprocessing.rst +++ b/docs/source/preprocessing.rst @@ -17,10 +17,10 @@ In this tutorial, we'll explore how to preprocess your data using 🤗 Transform call a :doc:`tokenizer `. You can build one using the tokenizer class associated to the model you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class. -As we saw in the :doc:`quicktour `, the tokenizer will first split a given text in words (or part of words, -punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to -build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to -work properly. +As we saw in the :doc:`quick tour `, the tokenizer will first split a given text in words (or part of +words, punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able +to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect +to work properly. .. note:: @@ -131,7 +131,7 @@ ones it should not (because they represent padding in this case). Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You -can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings. +can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer from throwing those kinds of warnings. .. _sentence-pairs: @@ -216,7 +216,6 @@ Everything you always wanted to know about padding and truncation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and - truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`. diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst index fccf181646..51d962b79b 100644 --- a/docs/source/quicktour.rst +++ b/docs/source/quicktour.rst @@ -158,7 +158,7 @@ Using the tokenizer We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern -that process (you can learn more about them in the :doc:`tokenizer summary `, which is why we need +that process (you can learn more about them in the :doc:`tokenizer summary `), which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was pretrained. diff --git a/docs/source/task_summary.rst b/docs/source/task_summary.rst index 36c8a2de7a..94cc615609 100644 --- a/docs/source/task_summary.rst +++ b/docs/source/task_summary.rst @@ -327,7 +327,7 @@ Masked Language Modeling Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for -downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al. +downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al. `__, part 4.2). Here is an example of using pipelines to replace a mask from a sequence: @@ -657,7 +657,7 @@ Here are the expected results: {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'} ] -Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City", +Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City", "DUMBO" and "Manhattan Bridge" have been identified as locations. Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following: diff --git a/docs/source/tokenizer_summary.rst b/docs/source/tokenizer_summary.rst index aaaff9fff1..44f0d86e6c 100644 --- a/docs/source/tokenizer_summary.rst +++ b/docs/source/tokenizer_summary.rst @@ -18,7 +18,7 @@ On this page, we will have a closer look at tokenization. As we saw in :doc:`the look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) `, :ref:`WordPiece `, -and :ref:`SentencePiece `, and show exemplary which tokenizer type is used by which model. +and :ref:`SentencePiece `, and show examples of which tokenizer type is used by which model. Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see @@ -72,7 +72,7 @@ greater than 50,000, especially if they are pretrained only on a single language So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation -for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``. +for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization. @@ -202,10 +202,10 @@ WordPiece WordPiece is the subword tokenization algorithm used for :doc:`BERT `, :doc:`DistilBERT `, and :doc:`Electra `. The algorithm was outlined in `Japanese and Korean -Voice Seach (Schuster et al., 2012) +Voice Search (Schuster et al., 2012) `__ and is very similar to BPE. WordPiece first initializes the vocabulary to include every character present in the training data and -progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent +progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary. So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is diff --git a/docs/source/training.rst b/docs/source/training.rst index 5cabfeca38..7daaaaa99a 100644 --- a/docs/source/training.rst +++ b/docs/source/training.rst @@ -14,7 +14,7 @@ Training and fine-tuning ======================================================================================================================= Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used -seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the +seamlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the standard training tools available in either framework. We will also show how to use our included :func:`~transformers.Trainer` class which handles much of the complexity of training for you.