diff --git a/docs/source/index.rst b/docs/source/index.rst
index 1c70c98584..4051ecbc8b 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -44,7 +44,7 @@ The documentation is organized in five parts:
   and a glossary.
 - **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
 - **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
-- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
+- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
   transformers model
 - The three last section contain the documentation of each public class and function, grouped in:
 
diff --git a/docs/source/model_doc/dpr.rst b/docs/source/model_doc/dpr.rst
index 86a60ff15d..6ab5c697dd 100644
--- a/docs/source/model_doc/dpr.rst
+++ b/docs/source/model_doc/dpr.rst
@@ -5,7 +5,7 @@ Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
-intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
+introduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
 Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
 
 The abstract from the paper is the following:
diff --git a/docs/source/model_summary.rst b/docs/source/model_summary.rst
index ea36587d10..d00193bab9 100644
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -530,7 +530,7 @@ Sequence-to-sequence model with the same encoder-decoder model architecture as B
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
 objective, called Gap Sentence Generation (GSG).
 
-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
+  * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
     BERT)
   * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
     causal mask to hide the future words like a regular auto-regressive transformer decoder.
diff --git a/docs/source/multilingual.rst b/docs/source/multilingual.rst
index 964cf5b373..a9d156de40 100644
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@@ -109,7 +109,7 @@ XLM-RoBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
-over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
+over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
 labeling and question answering.
 
 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
diff --git a/docs/source/perplexity.rst b/docs/source/perplexity.rst
index 910da6d444..5bd89341de 100644
--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
@@ -62,7 +62,7 @@ sliding the context window so that the model has more context when making each p
 This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
 favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
 practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
-1 token a time. This allows computation to procede much faster while still giving the model a large context to make
+1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
 predictions at each step.
 
 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
diff --git a/examples/benchmarking/plot_csv_file.py b/examples/benchmarking/plot_csv_file.py
index 7d3f1fd4b8..6614df0a98 100644
--- a/examples/benchmarking/plot_csv_file.py
+++ b/examples/benchmarking/plot_csv_file.py
@@ -25,7 +25,7 @@ class PlotArguments:
     )
     plot_along_batch: bool = field(
         default=False,
-        metadata={"help": "Whether to plot along batch size or sequence lengh. Defaults to sequence length."},
+        metadata={"help": "Whether to plot along batch size or sequence length. Defaults to sequence length."},
     )
     is_time: bool = field(
         default=False,
diff --git a/examples/distillation/README.md b/examples/distillation/README.md
index 272b8f8697..766ce217ac 100644
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -17,7 +17,7 @@ This folder contains the original code used to train Distil* as well as examples
 
 ## What is Distil*
 
-Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distilled-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
 
 We have applied the same method to other Transformer architectures and released the weights:
 - GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for **DistilGPT2** (after fine-tuning on the train set).
@@ -57,7 +57,7 @@ Here are the results on the *test* sets for 6 of the languages available in XNLI
 
 This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.
 
-**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0).
+**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breaking changes compared to v1.1.0).
 
 
 ## How to use DistilBERT
@@ -111,7 +111,7 @@ python scripts/binarized_data.py \
     --dump_file data/binarized_text
 ```
 
-Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
+Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smooths the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
 
 ```bash
 python scripts/token_counts.py \
@@ -173,7 +173,7 @@ python -m torch.distributed.launch \
         --token_counts data/token_counts.bert-base-uncased.pickle
 ```
 
-**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
+**Tips:** Starting distilled training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
 
 Happy distillation!
 
diff --git a/examples/distillation/distiller.py b/examples/distillation/distiller.py
index d724ac6e29..95e6ac0bbc 100644
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -188,7 +188,7 @@ class Distiller:
 
     def prepare_batch_mlm(self, batch):
         """
-        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
+        Prepare the batch: from the token_ids and the lengths, compute the attention mask and the masked label for MLM.
 
         Input:
         ------
@@ -200,7 +200,7 @@ class Distiller:
         -------
             token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
             attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
-            mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -100 where there is nothing to predict.
+            mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels. There is a -100 where there is nothing to predict.
         """
         token_ids, lengths = batch
         token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
@@ -253,7 +253,7 @@ class Distiller:
 
     def prepare_batch_clm(self, batch):
         """
-        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
+        Prepare the batch: from the token_ids and the lengths, compute the attention mask and the labels for CLM.
 
         Input:
         ------
diff --git a/examples/distillation/scripts/extract_distilbert.py b/examples/distillation/scripts/extract_distilbert.py
index 15b48802fb..e125f36187 100644
--- a/examples/distillation/scripts/extract_distilbert.py
+++ b/examples/distillation/scripts/extract_distilbert.py
@@ -86,7 +86,7 @@ if __name__ == "__main__":
             compressed_sd[f"vocab_layer_norm.{w}"] = state_dict[f"cls.predictions.transform.LayerNorm.{w}"]
 
     print(f"N layers selected for distillation: {std_idx}")
-    print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
+    print(f"Number of params transferred for distillation: {len(compressed_sd.keys())}")
 
-    print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
+    print(f"Save transferred checkpoint to {args.dump_checkpoint}.")
     torch.save(compressed_sd, args.dump_checkpoint)
diff --git a/examples/movement-pruning/README.md b/examples/movement-pruning/README.md
index 322f2b1bf9..fd6c0085e3 100644
--- a/examples/movement-pruning/README.md
+++ b/examples/movement-pruning/README.md
@@ -21,7 +21,7 @@ You can also have a look at this fun *Explain Like I'm Five* introductory [slide
 
 One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.
 
-In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the orignal dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
+In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the original dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
 
 While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance [Q8BERT](https://arxiv.org/abs/1910.06188), [And the Bit Goes Down](https://arxiv.org/abs/1907.05686) or [Quant-Noise](https://arxiv.org/abs/2004.07320)).
 
diff --git a/examples/movement-pruning/emmental/modules/binarizer.py b/examples/movement-pruning/emmental/modules/binarizer.py
index f6c6a732c4..b4a801d56d 100644
--- a/examples/movement-pruning/emmental/modules/binarizer.py
+++ b/examples/movement-pruning/emmental/modules/binarizer.py
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Binarizers take a (real value) matrice as input and produce a binary (values in {0,1}) mask of the same shape.
+Binarizers take a (real value) matrix as input and produce a binary (values in {0,1}) mask of the same shape.
 """
 
 import torch
diff --git a/model_cards/KB/albert-base-swedish-cased-alpha/README.md b/model_cards/KB/albert-base-swedish-cased-alpha/README.md
index ca87015b84..46bf3c700f 100644
--- a/model_cards/KB/albert-base-swedish-cased-alpha/README.md
+++ b/model_cards/KB/albert-base-swedish-cased-alpha/README.md
@@ -4,7 +4,7 @@ language: sv
 
 # Swedish BERT Models
 
-The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
 
 The following three models are currently available:
 
@@ -86,7 +86,7 @@ for token in nlp(text):
 print(l)
 ```
 
-Which should result in the following (though less cleanly formated):
+Which should result in the following (though less cleanly formatted):
 
 ```python
 [ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
@@ -104,7 +104,7 @@ Which should result in the following (though less cleanly formated):
 
 ### ALBERT base
 
-The easisest way to do this is, again, using Huggingface Transformers:
+The easiest way to do this is, again, using Huggingface Transformers:
 
 ```python
 from transformers import AutoModel,AutoTokenizer
diff --git a/model_cards/KB/bert-base-swedish-cased-ner/README.md b/model_cards/KB/bert-base-swedish-cased-ner/README.md
index ca87015b84..46bf3c700f 100644
--- a/model_cards/KB/bert-base-swedish-cased-ner/README.md
+++ b/model_cards/KB/bert-base-swedish-cased-ner/README.md
@@ -4,7 +4,7 @@ language: sv
 
 # Swedish BERT Models
 
-The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
 
 The following three models are currently available:
 
@@ -86,7 +86,7 @@ for token in nlp(text):
 print(l)
 ```
 
-Which should result in the following (though less cleanly formated):
+Which should result in the following (though less cleanly formatted):
 
 ```python
 [ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
@@ -104,7 +104,7 @@ Which should result in the following (though less cleanly formated):
 
 ### ALBERT base
 
-The easisest way to do this is, again, using Huggingface Transformers:
+The easiest way to do this is, again, using Huggingface Transformers:
 
 ```python
 from transformers import AutoModel,AutoTokenizer
diff --git a/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md b/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md
index aff625e4ac..4e59ee06c4 100644
--- a/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md
@@ -4,7 +4,7 @@ tags:
 ---
 
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
diff --git a/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md b/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md
index 8b5d33fe63..8856e7cc80 100644
--- a/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md
@@ -4,7 +4,7 @@ tags:
 ---
 
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
diff --git a/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md b/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md
index b4cea1eb98..1bf3ab0781 100644
--- a/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md
@@ -1,5 +1,5 @@
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
diff --git a/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md b/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md
index 220aa23b42..1cef39c264 100644
--- a/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md
@@ -1,5 +1,5 @@
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
diff --git a/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md b/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md
index a50d39b95b..8f9624f6ae 100644
--- a/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the downstream task (Intent Prediction) - Dataset 📚 
 
-Dataset ID: ```event2Mind``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```event2Mind``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md b/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md
index 05530b523c..524b1ad04b 100644
--- a/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
 
-Dataset ID: ```squad``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
index d842e65625..b20d8fc17d 100644
--- a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
 
-Dataset ID: ```squad_v2``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad_v2``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md b/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md
index 73932f1f0b..3f012c771a 100644
--- a/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the Dataset 📚 
 
-Dataset ID: ```wikisql``` from  [HugginFace/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
+Dataset ID: ```wikisql``` from  [Huggingface/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md b/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md
index 3e2b46cf6c..0241381d0e 100644
--- a/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the Dataset 📚 
 
-Dataset ID: ```wikisql``` from  [HugginFace/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
+Dataset ID: ```wikisql``` from  [Huggingface/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-small-finetuned-quora-for-paraphrasing/README.md b/model_cards/mrm8488/t5-small-finetuned-quora-for-paraphrasing/README.md
index cfe222660d..e5913a9bc1 100644
--- a/model_cards/mrm8488/t5-small-finetuned-quora-for-paraphrasing/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-quora-for-paraphrasing/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the downstream task (Question Paraphrasing) - Dataset 📚❓↔️❓
 
-Dataset ID: ```quora``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```quora``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-small-finetuned-squadv1/README.md b/model_cards/mrm8488/t5-small-finetuned-squadv1/README.md
index fde155929f..dc6b9c77ba 100644
--- a/model_cards/mrm8488/t5-small-finetuned-squadv1/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-squadv1/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
 
-Dataset ID: ```squad``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-small-finetuned-squadv2/README.md b/model_cards/mrm8488/t5-small-finetuned-squadv2/README.md
index 28420c1787..1fb9db76e4 100644
--- a/model_cards/mrm8488/t5-small-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-squadv2/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
 
-Dataset ID: ```squad_v2``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad_v2``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md b/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md
index ebdab18dd6..40aff2e0aa 100644
--- a/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 
 ## Details of the Dataset 📚 
 
-Dataset ID: ```wikisql``` from  [HugginFace/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
+Dataset ID: ```wikisql``` from  [Huggingface/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
 
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
diff --git a/src/transformers/modeling_tf_pytorch_utils.py b/src/transformers/modeling_tf_pytorch_utils.py
index adcc19c61b..5392d32157 100644
--- a/src/transformers/modeling_tf_pytorch_utils.py
+++ b/src/transformers/modeling_tf_pytorch_utils.py
@@ -39,7 +39,7 @@ def convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove="")
     return tuple with:
 
         - pytorch model weight name
-        - transpose: boolean indicating wether TF2.0 and PyTorch weights matrices are transposed with regards to each
+        - transpose: boolean indicating whether TF2.0 and PyTorch weights matrices are transposed with regards to each
           other
     """
     tf_name = tf_name.replace(":0", "")  # device ids
diff --git a/src/transformers/models/fsmt/modeling_fsmt.py b/src/transformers/models/fsmt/modeling_fsmt.py
index 56de8a716d..27b7737232 100644
--- a/src/transformers/models/fsmt/modeling_fsmt.py
+++ b/src/transformers/models/fsmt/modeling_fsmt.py
@@ -951,7 +951,7 @@ class FSMTModel(PretrainedFSMTModel):
                 output_hidden_states=output_hidden_states,
                 return_dict=return_dict,
             )
-        # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOuput when return_dict=False
+        # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=False
         elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
             encoder_outputs = BaseModelOutput(
                 last_hidden_state=encoder_outputs[0],
diff --git a/src/transformers/models/t5/modeling_tf_t5.py b/src/transformers/models/t5/modeling_tf_t5.py
index 4d721a531d..d92245ff73 100644
--- a/src/transformers/models/t5/modeling_tf_t5.py
+++ b/src/transformers/models/t5/modeling_tf_t5.py
@@ -642,7 +642,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):
             raise ValueError(f"You have to specify either {err_msg_prefix}inputs or {err_msg_prefix}inputs_embeds")
 
         if inputs_embeds is None:
-            assert self.embed_tokens is not None, "You have to intialize the model with valid token embeddings"
+            assert self.embed_tokens is not None, "You have to initialize the model with valid token embeddings"
             inputs_embeds = self.embed_tokens(input_ids)
 
         batch_size, seq_length = input_shape
diff --git a/src/transformers/models/transfo_xl/modeling_transfo_xl.py b/src/transformers/models/transfo_xl/modeling_transfo_xl.py
index f231e5e0c7..63ab53e07e 100644
--- a/src/transformers/models/transfo_xl/modeling_transfo_xl.py
+++ b/src/transformers/models/transfo_xl/modeling_transfo_xl.py
@@ -667,9 +667,9 @@ class TransfoXLLMHeadModelOutput(ModelOutput):
 
     @property
     def logits(self):
-        # prediciton scores are the output of the adaptive softmax, see
+        # prediction scores are the output of the adaptive softmax, see
         # the file `modeling_transfo_xl_utilities`. Since the adaptive
-        # softmax returns the log softmax value, `self.prediciton_scores`
+        # softmax returns the log softmax value, `self.prediction_scores`
         # are strictly speaking not exactly `logits`, but behave the same
         # way logits do.
         return self.prediction_scores
@@ -886,7 +886,7 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
                 head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)
             head_mask = head_mask.to(
                 dtype=next(self.parameters()).dtype
-            )  # switch to fload if need + fp16 compatibility
+            )  # switch to float if need + fp16 compatibility
         else:
             head_mask = [None] * self.n_layer
 
diff --git a/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py b/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py
index aee3c62948..98692746e7 100644
--- a/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py
+++ b/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py
@@ -91,8 +91,8 @@ class ProjectedAdaptiveLogSoftmax(nn.Module):
 
         Return:
             if labels is None: out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary else: out ::
-            [(len-1)*bsz] Negative log likelihood We could replace this implementation by the native PyTorch one if
-            their's had an option to set bias on all clusters in the native one. here:
+            [(len-1)*bsz] Negative log likelihood. We could replace this implementation by the native PyTorch one if
+            theirs had an option to set bias on all clusters in the native one. here:
             https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138
         """
 
diff --git a/src/transformers/models/xlm/modeling_tf_xlm.py b/src/transformers/models/xlm/modeling_tf_xlm.py
index 2ad636b2ce..4d08337f14 100644
--- a/src/transformers/models/xlm/modeling_tf_xlm.py
+++ b/src/transformers/models/xlm/modeling_tf_xlm.py
@@ -633,11 +633,11 @@ XLM_INPUTS_DOCSTRING = r"""
             A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are
             languages ids which can be obtained from the language names by using two conversion mappings provided in
             the configuration of the model (only provided for multilingual models). More precisely, the `language name
-            to language id` mapping is in :obj:`model.config.lang2id` (which is a dictionary strring to int) and the
+            to language id` mapping is in :obj:`model.config.lang2id` (which is a dictionary string to int) and the
             `language id to language name` mapping is in :obj:`model.config.id2lang` (dictionary int to string).
 
             See usage examples detailed in the :doc:`multilingual documentation <../multilingual>`.
-        ttoken_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
             1]``:
 
diff --git a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
index 708522fe74..02854c0c3e 100644
--- a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
+++ b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
@@ -54,7 +54,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 
 class XLMRobertaTokenizer(PreTrainedTokenizer):
     """
-    Adapted from :class:`~transfomers.RobertaTokenizer` and class:`~transfomers.XLNetTokenizer`. Based on
+    Adapted from :class:`~transformers.RobertaTokenizer` and class:`~transformers.XLNetTokenizer`. Based on
     `SentencePiece <https://github.com/google/sentencepiece>`__.
 
     This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
diff --git a/src/transformers/models/xlnet/modeling_xlnet.py b/src/transformers/models/xlnet/modeling_xlnet.py
index f526d55373..0144cf65ff 100755
--- a/src/transformers/models/xlnet/modeling_xlnet.py
+++ b/src/transformers/models/xlnet/modeling_xlnet.py
@@ -904,7 +904,7 @@ XLNET_INPUTS_DOCSTRING = r"""
             Mask values selected in ``[0, 1]``:
 
             - 1 for tokens that are **masked**,
-            - 0 for tokens that are **not maked**.
+            - 0 for tokens that are **not masked**.
 
             You can only uses one of :obj:`input_mask` and :obj:`attention_mask`.
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
@@ -1211,7 +1211,7 @@ class XLNetModel(XLNetPreTrainedModel):
                 head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)
             head_mask = head_mask.to(
                 dtype=next(self.parameters()).dtype
-            )  # switch to fload if need + fp16 compatibility
+            )  # switch to float if need + fp16 compatibility
         else:
             head_mask = [None] * self.n_layer
 
diff --git a/src/transformers/optimization_tf.py b/src/transformers/optimization_tf.py
index 370c10077e..f6d376fd09 100644
--- a/src/transformers/optimization_tf.py
+++ b/src/transformers/optimization_tf.py
@@ -167,9 +167,9 @@ class AdamWeightDecay(tf.keras.optimizers.Adam):
         beta_2 (:obj:`float`, `optional`, defaults to 0.999):
             The beta2 parameter in Adam, which is the exponential decay rate for the 2nd momentum estimates.
         epsilon (:obj:`float`, `optional`, defaults to 1e-7):
-            The epsilon paramenter in Adam, which is a small constant for numerical stability.
+            The epsilon parameter in Adam, which is a small constant for numerical stability.
         amsgrad (:obj:`bool`, `optional`, default to `False`):
-            Whether to apply AMSGrad varient of this algorithm or not, see `On the Convergence of Adam and Beyond
+            Whether to apply AMSGrad variant of this algorithm or not, see `On the Convergence of Adam and Beyond
             <https://arxiv.org/abs/1904.09237>`__.
         weight_decay_rate (:obj:`float`, `optional`, defaults to 0):
             The weight decay to apply.