Fix many typos (#8708)

2020-11-22 00:58:10 -03:00
parent 9c0afdaf7b
commit e1f3156b21
35 changed files with 51 additions and 51 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -44,7 +44,7 @@ The documentation is organized in five parts:
  and a glossary.
 - **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
 - **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
+- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
  transformers model
 - The three last section contain the documentation of each public class and function, grouped in:
--- a/docs/source/model_doc/dpr.rst
+++ b/docs/source/model_doc/dpr.rst
@@ -5,7 +5,7 @@ Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
-intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
+introduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
 Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
 The abstract from the paper is the following:
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -530,7 +530,7 @@ Sequence-to-sequence model with the same encoder-decoder model architecture as B
 two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
 objective, called Gap Sentence Generation (GSG).
-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in
+  * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
    BERT)
  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a
    causal mask to hide the future words like a regular auto-regressive transformer decoder.
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@@ -109,7 +109,7 @@ XLM-RoBERTa
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
-over previously released multi-lingual models like mBERT or XLM on downstream taks like classification, sequence
+over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
 labeling and question answering.
 Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
@@ -62,7 +62,7 @@ sliding the context window so that the model has more context when making each p
 This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
 favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good
 practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by
-1 token a time. This allows computation to procede much faster while still giving the model a large context to make
+1 token a time. This allows computation to proceed much faster while still giving the model a large context to make
 predictions at each step.
 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
--- a/examples/benchmarking/plot_csv_file.py
+++ b/examples/benchmarking/plot_csv_file.py
@@ -25,7 +25,7 @@ class PlotArguments:
    )
    plot_along_batch: bool = field(
        default=False,
-        metadata={"help": "Whether to plot along batch size or sequence lengh. Defaults to sequence length."},
+        metadata={"help": "Whether to plot along batch size or sequence length. Defaults to sequence length."},
    )
    is_time: bool = field(
        default=False,
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -17,7 +17,7 @@ This folder contains the original code used to train Distil* as well as examples
 ## What is Distil*
-Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distilled-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
 We have applied the same method to other Transformer architectures and released the weights:
 - GPT2: on the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for **DistilGPT2** (after fine-tuning on the train set).
@@ -57,7 +57,7 @@ Here are the results on the *test* sets for 6 of the languages available in XNLI
 This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`.
-**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0).
+**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breaking changes compared to v1.1.0).
 ## How to use DistilBERT
@@ -111,7 +111,7 @@ python scripts/binarized_data.py \
    --dump_file data/binarized_text
 ```
-Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
+Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smooths the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurrences of each tokens in the data:
 ```bash
 python scripts/token_counts.py \
@@ -173,7 +173,7 @@ python -m torch.distributed.launch \
        --token_counts data/token_counts.bert-base-uncased.pickle
 ```
-**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
+**Tips:** Starting distilled training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!
 Happy distillation!
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -188,7 +188,7 @@ class Distiller:
    def prepare_batch_mlm(self, batch):
        """
-        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.
+        Prepare the batch: from the token_ids and the lengths, compute the attention mask and the masked label for MLM.
        Input:
        ------
@@ -200,7 +200,7 @@ class Distiller:
        -------
            token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
            attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
-            mlm_labels: `torch.tensor(bs, seq_length)` - The masked languge modeling labels. There is a -100 where there is nothing to predict.
+            mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels. There is a -100 where there is nothing to predict.
        """
        token_ids, lengths = batch
        token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
@@ -253,7 +253,7 @@ class Distiller:
    def prepare_batch_clm(self, batch):
        """
-        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
+        Prepare the batch: from the token_ids and the lengths, compute the attention mask and the labels for CLM.
        Input:
        ------
--- a/examples/distillation/scripts/extract_distilbert.py
+++ b/examples/distillation/scripts/extract_distilbert.py
@@ -86,7 +86,7 @@ if __name__ == "__main__":
            compressed_sd[f"vocab_layer_norm.{w}"] = state_dict[f"cls.predictions.transform.LayerNorm.{w}"]
    print(f"N layers selected for distillation: {std_idx}")
-    print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
+    print(f"Number of params transferred for distillation: {len(compressed_sd.keys())}")
-    print(f"Save transfered checkpoint to {args.dump_checkpoint}.")
+    print(f"Save transferred checkpoint to {args.dump_checkpoint}.")
    torch.save(compressed_sd, args.dump_checkpoint)
--- a/examples/movement-pruning/README.md
+++ b/examples/movement-pruning/README.md
@@ -21,7 +21,7 @@ You can also have a look at this fun *Explain Like I'm Five* introductory [slide
 One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.
-In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the orignal dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
+In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the original dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
 While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance [Q8BERT](https://arxiv.org/abs/1910.06188), [And the Bit Goes Down](https://arxiv.org/abs/1907.05686) or [Quant-Noise](https://arxiv.org/abs/2004.07320)).
--- a/examples/movement-pruning/emmental/modules/binarizer.py
+++ b/examples/movement-pruning/emmental/modules/binarizer.py
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Binarizers take a (real value) matrice as input and produce a binary (values in {0,1}) mask of the same shape.
+Binarizers take a (real value) matrix as input and produce a binary (values in {0,1}) mask of the same shape.
 """
 import torch
--- a/model_cards/KB/albert-base-swedish-cased-alpha/README.md
+++ b/model_cards/KB/albert-base-swedish-cased-alpha/README.md
@@ -4,7 +4,7 @@ language: sv
 # Swedish BERT Models
-The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
 The following three models are currently available:
@@ -86,7 +86,7 @@ for token in nlp(text):
 print(l)
 ```
-Which should result in the following (though less cleanly formated):
+Which should result in the following (though less cleanly formatted):
 ```python
 [ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
@@ -104,7 +104,7 @@ Which should result in the following (though less cleanly formated):
 ### ALBERT base
-The easisest way to do this is, again, using Huggingface Transformers:
+The easiest way to do this is, again, using Huggingface Transformers:
 ```python
 from transformers import AutoModel,AutoTokenizer
--- a/model_cards/KB/bert-base-swedish-cased-ner/README.md
+++ b/model_cards/KB/bert-base-swedish-cased-ner/README.md
@@ -4,7 +4,7 @@ language: sv
 # Swedish BERT Models
-The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
 The following three models are currently available:
@@ -86,7 +86,7 @@ for token in nlp(text):
 print(l)
 ```
-Which should result in the following (though less cleanly formated):
+Which should result in the following (though less cleanly formatted):
 ```python
 [ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
@@ -104,7 +104,7 @@ Which should result in the following (though less cleanly formated):
 ### ALBERT base
-The easisest way to do this is, again, using Huggingface Transformers:
+The easiest way to do this is, again, using Huggingface Transformers:
 ```python
 from transformers import AutoModel,AutoTokenizer
--- a/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-albert-base-v2/README.md
@@ -4,7 +4,7 @@ tags:
 ---
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
--- a/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-albert-xxlarge-v1/README.md
@@ -4,7 +4,7 @@ tags:
 ---
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
--- a/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-distilbert-base-uncased/README.md
@@ -1,5 +1,5 @@
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
--- a/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md
+++ b/model_cards/elgeish/cs224n-squad2.0-roberta-base/README.md
@@ -1,5 +1,5 @@
 ## CS224n SQuAD2.0 Project Dataset
-The goal of this model is to save CS224n students GPU time when establising
+The goal of this model is to save CS224n students GPU time when establishing
 baselines to beat for the [Default Final Project](http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf).
 The training set used to fine-tune this model is the same as
 the [official one](https://rajpurkar.github.io/SQuAD-explorer/); however,
--- a/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-e2m-intent/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Intent Prediction) - Dataset 📚 
-Dataset ID: ```event2Mind``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```event2Mind``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-question-generation-ap/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
-Dataset ID: ```squad``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
-Dataset ID: ```squad_v2``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad_v2``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-wikiSQL-sql-to-en/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the Dataset 📚 
-Dataset ID: ```wikisql``` from  [HugginFace/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
+Dataset ID: ```wikisql``` from  [Huggingface/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-wikiSQL/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the Dataset 📚 
-Dataset ID: ```wikisql``` from  [HugginFace/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
+Dataset ID: ```wikisql``` from  [Huggingface/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-small-finetuned-quora-for-paraphrasing/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-quora-for-paraphrasing/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Question Paraphrasing) - Dataset 📚❓↔️❓
-Dataset ID: ```quora``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```quora``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-small-finetuned-squadv1/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-squadv1/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
-Dataset ID: ```squad``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-small-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-squadv2/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
-Dataset ID: ```squad_v2``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+Dataset ID: ```squad_v2``` from  [Huggingface/NLP](https://github.com/huggingface/nlp)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md
+++ b/model_cards/mrm8488/t5-small-finetuned-wikiSQL/README.md
@@ -19,7 +19,7 @@ Transfer learning, where a model is first pre-trained on a data-rich task before
 ## Details of the Dataset 📚 
-Dataset ID: ```wikisql``` from  [HugginFace/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
+Dataset ID: ```wikisql``` from  [Huggingface/NLP](https://huggingface.co/nlp/viewer/?dataset=wikisql)
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
--- a/src/transformers/modeling_tf_pytorch_utils.py
+++ b/src/transformers/modeling_tf_pytorch_utils.py
@@ -39,7 +39,7 @@ def convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove="")
    return tuple with:
        - pytorch model weight name
-        - transpose: boolean indicating wether TF2.0 and PyTorch weights matrices are transposed with regards to each
+        - transpose: boolean indicating whether TF2.0 and PyTorch weights matrices are transposed with regards to each
          other
    """
    tf_name = tf_name.replace(":0", "")  # device ids
--- a/src/transformers/models/fsmt/modeling_fsmt.py
+++ b/src/transformers/models/fsmt/modeling_fsmt.py
@@ -951,7 +951,7 @@ class FSMTModel(PretrainedFSMTModel):
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
-        # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOuput when return_dict=False
+        # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=False
        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
            encoder_outputs = BaseModelOutput(
                last_hidden_state=encoder_outputs[0],
--- a/src/transformers/models/t5/modeling_tf_t5.py
+++ b/src/transformers/models/t5/modeling_tf_t5.py
@@ -642,7 +642,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):
            raise ValueError(f"You have to specify either {err_msg_prefix}inputs or {err_msg_prefix}inputs_embeds")
        if inputs_embeds is None:
-            assert self.embed_tokens is not None, "You have to intialize the model with valid token embeddings"
+            assert self.embed_tokens is not None, "You have to initialize the model with valid token embeddings"
            inputs_embeds = self.embed_tokens(input_ids)
        batch_size, seq_length = input_shape
--- a/src/transformers/models/transfo_xl/modeling_transfo_xl.py
+++ b/src/transformers/models/transfo_xl/modeling_transfo_xl.py
@@ -667,9 +667,9 @@ class TransfoXLLMHeadModelOutput(ModelOutput):
    @property
    def logits(self):
-        # prediciton scores are the output of the adaptive softmax, see
+        # prediction scores are the output of the adaptive softmax, see
        # the file `modeling_transfo_xl_utilities`. Since the adaptive
-        # softmax returns the log softmax value, `self.prediciton_scores`
+        # softmax returns the log softmax value, `self.prediction_scores`
        # are strictly speaking not exactly `logits`, but behave the same
        # way logits do.
        return self.prediction_scores
@@ -886,7 +886,7 @@ class TransfoXLModel(TransfoXLPreTrainedModel):
                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)
            head_mask = head_mask.to(
                dtype=next(self.parameters()).dtype
-            )  # switch to fload if need + fp16 compatibility
+            )  # switch to float if need + fp16 compatibility
        else:
            head_mask = [None] * self.n_layer
--- a/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py
+++ b/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py
@@ -91,8 +91,8 @@ class ProjectedAdaptiveLogSoftmax(nn.Module):
        Return:
            if labels is None: out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary else: out ::
-            [(len-1)*bsz] Negative log likelihood We could replace this implementation by the native PyTorch one if
+            [(len-1)*bsz] Negative log likelihood. We could replace this implementation by the native PyTorch one if
-            their's had an option to set bias on all clusters in the native one. here:
+            theirs had an option to set bias on all clusters in the native one. here:
            https://github.com/pytorch/pytorch/blob/dbe6a7a9ff1a364a8706bf5df58a1ca96d2fd9da/torch/nn/modules/adaptive.py#L138
        """
--- a/src/transformers/models/xlm/modeling_tf_xlm.py
+++ b/src/transformers/models/xlm/modeling_tf_xlm.py
@@ -633,11 +633,11 @@ XLM_INPUTS_DOCSTRING = r"""
            A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are
            languages ids which can be obtained from the language names by using two conversion mappings provided in
            the configuration of the model (only provided for multilingual models). More precisely, the `language name
-            to language id` mapping is in :obj:`model.config.lang2id` (which is a dictionary strring to int) and the
+            to language id` mapping is in :obj:`model.config.lang2id` (which is a dictionary string to int) and the
            `language id to language name` mapping is in :obj:`model.config.id2lang` (dictionary int to string).
            See usage examples detailed in the :doc:`multilingual documentation <../multilingual>`.
-        ttoken_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
            Segment token indices to indicate first and second portions of the inputs. Indices are selected in ``[0,
            1]``:
--- a/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
+++ b/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
@@ -54,7 +54,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 class XLMRobertaTokenizer(PreTrainedTokenizer):
    """
-    Adapted from :class:`~transfomers.RobertaTokenizer` and class:`~transfomers.XLNetTokenizer`. Based on
+    Adapted from :class:`~transformers.RobertaTokenizer` and class:`~transformers.XLNetTokenizer`. Based on
    `SentencePiece <https://github.com/google/sentencepiece>`__.
    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
--- a/src/transformers/models/xlnet/modeling_xlnet.py
+++ b/src/transformers/models/xlnet/modeling_xlnet.py
@@ -904,7 +904,7 @@ XLNET_INPUTS_DOCSTRING = r"""
            Mask values selected in ``[0, 1]``:
            - 1 for tokens that are **masked**,
-            - 0 for tokens that are **not maked**.
+            - 0 for tokens that are **not masked**.
            You can only uses one of :obj:`input_mask` and :obj:`attention_mask`.
        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
@@ -1211,7 +1211,7 @@ class XLNetModel(XLNetPreTrainedModel):
                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)
            head_mask = head_mask.to(
                dtype=next(self.parameters()).dtype
-            )  # switch to fload if need + fp16 compatibility
+            )  # switch to float if need + fp16 compatibility
        else:
            head_mask = [None] * self.n_layer
--- a/src/transformers/optimization_tf.py
+++ b/src/transformers/optimization_tf.py
@@ -167,9 +167,9 @@ class AdamWeightDecay(tf.keras.optimizers.Adam):
        beta_2 (:obj:`float`, `optional`, defaults to 0.999):
            The beta2 parameter in Adam, which is the exponential decay rate for the 2nd momentum estimates.
        epsilon (:obj:`float`, `optional`, defaults to 1e-7):
-            The epsilon paramenter in Adam, which is a small constant for numerical stability.
+            The epsilon parameter in Adam, which is a small constant for numerical stability.
        amsgrad (:obj:`bool`, `optional`, default to `False`):
-            Whether to apply AMSGrad varient of this algorithm or not, see `On the Convergence of Adam and Beyond
+            Whether to apply AMSGrad variant of this algorithm or not, see `On the Convergence of Adam and Beyond
            <https://arxiv.org/abs/1904.09237>`__.
        weight_decay_rate (:obj:`float`, `optional`, defaults to 0):
            The weight decay to apply.