diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst
index 5d00021eda..2e8e43e563 100644
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -226,7 +226,7 @@ Contrary to RNNs that have the position of each token embedded within them, tran
 each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
 the list of tokens.
 
-They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
+They are an optional parameter. If no ``position_ids`` are passed to the model, the IDs are automatically created as
 absolute positional embeddings.
 
 Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
diff --git a/docs/source/model_summary.rst b/docs/source/model_summary.rst
index 6f66958737..5567c49e29 100644
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -16,7 +16,7 @@ Summary of the models
 This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original `transformer
 model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
 <http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also checkout the :doc:`pretrained model
+models. You can check them more in detail in their respective documentation. Also check out the :doc:`pretrained model
 page </pretrained_models>` to see the checkpoints available for each type of model and all `the community models
 <https://huggingface.co/models>`_.
 
@@ -30,7 +30,7 @@ Each one of the models in the library falls into one of the following categories
 
 Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
 previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
-sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
+sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those
 models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A
 typical example of such models is GPT.
 
@@ -512,8 +512,8 @@ BART
 <https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
 
 Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
-fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder
-, on the pretraining tasks, a composition of the following transformations are applied:
+fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). A composition of
+the following transformations are applied on the pretraining tasks for the encoder:
 
   * mask random tokens (like in BERT)
   * delete random tokens
diff --git a/docs/source/philosophy.rst b/docs/source/philosophy.rst
index 8b06c981d5..644ef51c6b 100644
--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
@@ -78,7 +78,7 @@ The library is built around three types of classes for each model:
 All these classes can be instantiated from pretrained instances and saved locally using two methods:
 
 - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
-  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>` or
+  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`) or
   stored locally (or on a server) by the user,
 - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
   :obj:`from_pretrained()`.
diff --git a/docs/source/preprocessing.rst b/docs/source/preprocessing.rst
index 5313ff1262..773f84783d 100644
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@@ -17,10 +17,10 @@ In this tutorial, we'll explore how to preprocess your data using 🤗 Transform
 call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
 you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
 
-As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
-punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
-build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
-work properly.
+As we saw in the :doc:`quick tour </quicktour>`, the tokenizer will first split a given text in words (or part of
+words, punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able
+to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
+to work properly.
 
 .. note::
 
@@ -131,7 +131,7 @@ ones it should not (because they represent padding in this case).
 
 
 Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
-can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
+can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer from throwing those kinds of warnings.
 
 .. _sentence-pairs:
 
@@ -216,7 +216,6 @@ Everything you always wanted to know about padding and truncation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
-
 truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
 three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
 
diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst
index fccf181646..51d962b79b 100644
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -158,7 +158,7 @@ Using the tokenizer
 
 We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
 words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
-that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`, which is why we need
+that process (you can learn more about them in the :doc:`tokenizer summary <tokenizer_summary>`), which is why we need
 to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
 pretrained.
 
diff --git a/docs/source/task_summary.rst b/docs/source/task_summary.rst
index 36c8a2de7a..94cc615609 100644
--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@@ -327,7 +327,7 @@ Masked Language Modeling
 Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
 fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
 right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for
-downstream tasks, requiring bi-directional context such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
+downstream tasks requiring bi-directional context, such as SQuAD (question answering, see `Lewis, Lui, Goyal et al.
 <https://arxiv.org/abs/1910.13461>`__, part 4.2).
 
 Here is an example of using pipelines to replace a mask from a sequence:
@@ -657,7 +657,7 @@ Here are the expected results:
         {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
     ]
 
-Note, how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
+Note how the tokens of the sequence "Hugging Face" have been identified as an organisation, and "New York City",
 "DUMBO" and "Manhattan Bridge" have been identified as locations.
 
 Here is an example of doing named entity recognition, using a model and a tokenizer. The process is the following:
diff --git a/docs/source/tokenizer_summary.rst b/docs/source/tokenizer_summary.rst
index aaaff9fff1..44f0d86e6c 100644
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@@ -18,7 +18,7 @@ On this page, we will have a closer look at tokenization. As we saw in :doc:`the
 look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
 text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
 tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
-and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
+and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
 
 Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
 type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
@@ -72,7 +72,7 @@ greater than 50,000, especially if they are pretrained only on a single language
 So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
 character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
 the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
-for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
+for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
 Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
 transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
 
@@ -202,10 +202,10 @@ WordPiece
 
 WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
 <model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
-Voice Seach (Schuster et al., 2012)
+Voice Search (Schuster et al., 2012)
 <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
 BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
-progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
+progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
 symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
 
 So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
diff --git a/docs/source/training.rst b/docs/source/training.rst
index 5cabfeca38..7daaaaa99a 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -14,7 +14,7 @@ Training and fine-tuning
 =======================================================================================================================
 
 Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
-seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
+seamlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
 standard training tools available in either framework. We will also show how to use our included
 :func:`~transformers.Trainer` class which handles much of the complexity of training for you.