From 37508816d650a8074bc31c761e20872c2e5eaec4 Mon Sep 17 00:00:00 2001
From: co63oc <co63oc@users.noreply.github.com>
Date: Tue, 4 Mar 2025 21:47:41 +0800
Subject: [PATCH] chore: Fix typos in docs and examples (#36524)

Fix typos in docs and examples

Signed-off-by: co63oc <co63oc@users.noreply.github.com>
---
 awesome-transformers.md                                   | 2 +-
 docs/source/en/add_new_model.md                           | 8 ++++----
 docs/source/en/agents.md                                  | 2 +-
 docs/source/en/deepspeed.md                               | 4 ++--
 docs/source/en/generation_features.md                     | 2 +-
 docs/source/en/llm_tutorial.md                            | 2 +-
 docs/source/en/model_doc/speech_to_text.md                | 2 +-
 docs/source/en/model_doc/tvp.md                           | 2 +-
 docs/source/en/modular_transformers.md                    | 4 ++--
 docs/source/en/perf_hardware.md                           | 2 +-
 docs/source/en/perf_train_gpu_many.md                     | 2 +-
 docs/source/en/pipeline_tutorial.md                       | 2 +-
 docs/source/en/testing.md                                 | 2 +-
 docs/source/zh/agents.md                                  | 2 +-
 .../run_flax_speech_recognition_seq2seq.py                | 2 +-
 .../run_wav2vec2_pretraining_no_trainer.py                | 4 ++--
 .../speech-recognition/run_speech_recognition_ctc.py      | 2 +-
 .../run_speech_recognition_ctc_adapter.py                 | 4 ++--
 .../pytorch/text-classification/run_classification.py     | 2 +-
 examples/pytorch/text-generation/README.md                | 2 +-
 examples/pytorch/token-classification/README.md           | 2 +-
 .../research_projects/bertabs/configuration_bertabs.py    | 4 ++--
 .../convert_bertabs_original_pytorch_checkpoint.py        | 6 +++---
 examples/research_projects/bertabs/modeling_bertabs.py    | 2 +-
 examples/research_projects/bertabs/run_summarization.py   | 4 ++--
 .../codeparrot/scripts/codeparrot_training.py             | 2 +-
 .../research_projects/codeparrot/scripts/preprocessing.py | 2 +-
 .../performer/modeling_flax_performer_utils.py            | 2 +-
 .../research_projects/rag-end2end-retriever/README.md     | 2 +-
 .../rag-end2end-retriever/lightning_base.py               | 4 ++--
 examples/research_projects/rag/README.md                  | 2 +-
 .../rag/distributed_pytorch_retriever.py                  | 2 +-
 examples/research_projects/rag/finetune_rag.py            | 2 +-
 .../robust-speech-event/run_speech_recognition_ctc_bnb.py | 2 +-
 .../run_speech_recognition_ctc_streaming.py               | 2 +-
 examples/research_projects/wav2vec2/run_asr.py            | 2 +-
 examples/research_projects/wav2vec2/run_common_voice.py   | 2 +-
 examples/research_projects/wav2vec2/run_pretrain.py       | 2 +-
 38 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/awesome-transformers.md b/awesome-transformers.md
index f9676f29b2..29f50184ec 100644
--- a/awesome-transformers.md
+++ b/awesome-transformers.md
@@ -47,7 +47,7 @@ Keywords: LLMs, Large Language Models, Agents, Chains
 
 ## [LlamaIndex](https://github.com/run-llama/llama_index)
 
-[LlamaIndex](https://github.com/run-llama/llama_index) is a project that provides a central interface to connect your LLM's with external data. It provides various kinds of indices and retreival mechanisms to perform different LLM tasks and obtain knowledge-augmented results.
+[LlamaIndex](https://github.com/run-llama/llama_index) is a project that provides a central interface to connect your LLM's with external data. It provides various kinds of indices and retrieval mechanisms to perform different LLM tasks and obtain knowledge-augmented results.
 
 Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation 
 
diff --git a/docs/source/en/add_new_model.md b/docs/source/en/add_new_model.md
index bfab511972..6f8e5499ba 100644
--- a/docs/source/en/add_new_model.md
+++ b/docs/source/en/add_new_model.md
@@ -57,7 +57,7 @@ There is never more than two levels of abstraction for any model to keep the cod
 
 Other important functions like the forward method are defined in the `modeling.py` file.
 
-Specific model heads (for example, sequence classification or language modeling) should call the base model in the forward pass rather than inherting from it to keep abstraction low.
+Specific model heads (for example, sequence classification or language modeling) should call the base model in the forward pass rather than inheriting from it to keep abstraction low.
 
 New models require a configuration, for example `BrandNewLlamaConfig`, that is stored as an attribute of [`PreTrainedModel`].
 
@@ -233,7 +233,7 @@ If you run into issues, you'll need to choose one of the following debugging str
 This strategy relies on breaking the original model into smaller sub-components, such as when the code can be easily run in eager mode. While more difficult, there are some advantages to this approach.
 
 1. It is easier later to compare the original model to your implementation. You can automatically verify that each individual component matches its corresponding component in the Transformers' implementation. This is better than relying on a visual comparison based on print statements.
-2. It is easier to port individal components instead of the entire model.
+2. It is easier to port individual components instead of the entire model.
 3. It is easier for understanding how a model works by breaking it up into smaller parts.
 4. It is easier to prevent regressions at a later stage when you change your code thanks to component-by-component tests.
 
@@ -328,7 +328,7 @@ def _init_weights(self, module):
 
 The initialization scheme can look different if you need to adapt it to your model. For example, [`Wav2Vec2ForPreTraining`] initializes [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) in its last two linear layers.
 
-The `_is_hf_initialized` flag makes sure the submodule is only initialized once. Setting `module.project_q` and `module.project_hid` to `True` ensures the custom initialization is not overriden later. The `_init_weights` function won't be applied to these modules.
+The `_is_hf_initialized` flag makes sure the submodule is only initialized once. Setting `module.project_q` and `module.project_hid` to `True` ensures the custom initialization is not overridden later. The `_init_weights` function won't be applied to these modules.
 
 ```py
 def _init_weights(self, module):
@@ -457,7 +457,7 @@ Don't be discouraged if your forward pass isn't identical with the output from t
 Your output should have a precision of *1e-3*. Ensure the output shapes and output values are identical. Common reasons for why the outputs aren't identical include:
 
 - Some layers were not added (activation layer or a residual connection).
-- The word embedding matix is not tied.
+- The word embedding matrix is not tied.
 - The wrong positional embeddings are used because the original implementation includes an offset.
 - Dropout is applied during the forward pass. Fix this error by making sure `model.training` is `False` and passing `self.training` to [torch.nn.functional.dropout](https://pytorch.org/docs/stable/nn.functional.html?highlight=dropout#torch.nn.functional.dropout).
 
diff --git a/docs/source/en/agents.md b/docs/source/en/agents.md
index 216a943e8d..bd24d8ce30 100644
--- a/docs/source/en/agents.md
+++ b/docs/source/en/agents.md
@@ -159,7 +159,7 @@ Here are a few examples using notional tools:
 ---
 {examples}
 
-Above example were using notional tools that might not exist for you. You only have acces to those tools:
+Above example were using notional tools that might not exist for you. You only have access to those tools:
 <<tool_names>>
 You also can perform computations in the python code you generate.
 
diff --git a/docs/source/en/deepspeed.md b/docs/source/en/deepspeed.md
index 4a84f93383..4d1df98e50 100644
--- a/docs/source/en/deepspeed.md
+++ b/docs/source/en/deepspeed.md
@@ -840,7 +840,7 @@ Unless you have a lot of free CPU memory, fp32 weights shouldn't be saved during
 <hfoptions id="save">
 <hfoption id="offline">
 
-DeepSpeed provies a [zero_to_fp32.py](https://github.com/microsoft/DeepSpeed/blob/91829476a8fd4d0d9268c03c1d56795d20a51c12/deepspeed/utils/zero_to_fp32.py#L14) script at the top-level checkpoint folder for extracting weights at any point. This is a standalone script and you don't need a config file or [`Trainer`].
+DeepSpeed provides a [zero_to_fp32.py](https://github.com/microsoft/DeepSpeed/blob/91829476a8fd4d0d9268c03c1d56795d20a51c12/deepspeed/utils/zero_to_fp32.py#L14) script at the top-level checkpoint folder for extracting weights at any point. This is a standalone script and you don't need a config file or [`Trainer`].
 
 For example, if your checkpoint folder looks like the one shown below, then you can run the following command to create and consolidate the fp32 weights from multiple GPUs into a single `pytorch_model.bin` file. The script automatically discovers the subfolder `global_step1` which contains the checkpoint.
 
@@ -942,7 +942,7 @@ import deepspeed
 ds_config = {...}
 # must run before instantiating the model to detect zero 3
 dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
-# randomly intialize model weights
+# randomly initialize model weights
 config = AutoConfig.from_pretrained("openai-community/gpt2")
 model = AutoModel.from_config(config)
 engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
diff --git a/docs/source/en/generation_features.md b/docs/source/en/generation_features.md
index 56163ba01d..19ac987807 100644
--- a/docs/source/en/generation_features.md
+++ b/docs/source/en/generation_features.md
@@ -50,7 +50,7 @@ The `streamer` parameter is compatible with any class with a [`~TextStreamer.put
 
 Watermarking is useful for detecting whether text is generated. The [watermarking strategy](https://hf.co/papers/2306.04634) in Transformers randomly "colors" a subset of the tokens green. When green tokens are generated, they have a small bias added to their logits, and a higher probability of being generated. You can detect generated text by comparing the proportion of green tokens to the amount of green tokens typically found in human-generated text.
 
-Watermarking is supported for any generative model in Transformers and doesn't require an extra classfication model to detect the watermarked text.
+Watermarking is supported for any generative model in Transformers and doesn't require an extra classification model to detect the watermarked text.
 
 Create a [`WatermarkingConfig`] with the bias value to add to the logits and watermarking algorithm. The example below uses the `"selfhash"` algorithm, where the green token selection only depends on the current token. Pass the [`WatermarkingConfig`] to [`~GenerationMixin.generate`].
 
diff --git a/docs/source/en/llm_tutorial.md b/docs/source/en/llm_tutorial.md
index e5c254debf..d867657202 100644
--- a/docs/source/en/llm_tutorial.md
+++ b/docs/source/en/llm_tutorial.md
@@ -87,7 +87,7 @@ You can customize [`~GenerationMixin.generate`] by overriding the parameters and
 model.generate(**inputs, num_beams=4, do_sample=True)
 ```
 
-[`~GenerationMixin.generate`] can also be extended with external libraries or custom code. The `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manupulating the next token probability distribution. `stopping_criteria` supports custom [`StoppingCriteria`] to stop text generation. Check out the [logits-processor-zoo](https://github.com/NVIDIA/logits-processor-zoo) for more examples of external [`~GenerationMixin.generate`]-compatible extensions.
+[`~GenerationMixin.generate`] can also be extended with external libraries or custom code. The `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution. `stopping_criteria` supports custom [`StoppingCriteria`] to stop text generation. Check out the [logits-processor-zoo](https://github.com/NVIDIA/logits-processor-zoo) for more examples of external [`~GenerationMixin.generate`]-compatible extensions.
 
 Refer to the [Generation strategies](./generation_strategies) guide to learn more about search, sampling, and decoding strategies.
 
diff --git a/docs/source/en/model_doc/speech_to_text.md b/docs/source/en/model_doc/speech_to_text.md
index 8b375374ea..bc65ea7965 100644
--- a/docs/source/en/model_doc/speech_to_text.md
+++ b/docs/source/en/model_doc/speech_to_text.md
@@ -74,7 +74,7 @@ be installed as follows: `apt install libsndfile1-dev`
   For multilingual speech translation models, `eos_token_id` is used as the `decoder_start_token_id` and
   the target language id is forced as the first generated token. To force the target language id as the first
   generated token, pass the `forced_bos_token_id` parameter to the `generate()` method. The following
-  example shows how to transate English speech to French text using the *facebook/s2t-medium-mustc-multilingual-st*
+  example shows how to translate English speech to French text using the *facebook/s2t-medium-mustc-multilingual-st*
   checkpoint.
 
 ```python
diff --git a/docs/source/en/model_doc/tvp.md b/docs/source/en/model_doc/tvp.md
index 33b31c8602..cadb6e71f0 100644
--- a/docs/source/en/model_doc/tvp.md
+++ b/docs/source/en/model_doc/tvp.md
@@ -111,7 +111,7 @@ def decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps
     Returns:
         frames (tensor): decoded frames from the video.
     '''
-    assert clip_idx >= -2, "Not a valied clip_idx {}".format(clip_idx)
+    assert clip_idx >= -2, "Not a valid clip_idx {}".format(clip_idx)
     frames, fps = pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps)
     clip_size = sampling_rate * num_frames / target_fps * fps
     index = np.linspace(0, clip_size - 1, num_frames)
diff --git a/docs/source/en/modular_transformers.md b/docs/source/en/modular_transformers.md
index 77080042c5..4d79f86148 100644
--- a/docs/source/en/modular_transformers.md
+++ b/docs/source/en/modular_transformers.md
@@ -355,7 +355,7 @@ class Olmo2Model(OlmoModel):
         )
 ```
 
-You only need to change the *type* of the `self.norm` attribute to use `RMSNorm` isntead of `LayerNorm`. This change doesn't affect the logic in the forward method (layer name and usage is identical to the parent class), so you don't need to overwrite it. The linter automatically unravels it.
+You only need to change the *type* of the `self.norm` attribute to use `RMSNorm` instead of `LayerNorm`. This change doesn't affect the logic in the forward method (layer name and usage is identical to the parent class), so you don't need to overwrite it. The linter automatically unravels it.
 
 ### Model head
 
@@ -374,7 +374,7 @@ The logic is identical to `OlmoForCausalLM` which means you don't need to make a
 
 The [modeling_olmo2.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/olmo2/modeling_olmo2.py) generated by the linter also contains some classes (`Olmo2MLP`, `Olmo2RotaryEmbedding`, `Olmo2PreTrainedModel`) that weren't explicitly defined in `modular_olmo2.py`.
 
-Classes that are a dependency of an inherited class but aren't explicitly defined are automatically added as a part of depdendency tracing. This is similar to how some functions were added to the `Attention` class without drrectly importing them.
+Classes that are a dependency of an inherited class but aren't explicitly defined are automatically added as a part of dependency tracing. This is similar to how some functions were added to the `Attention` class without directly importing them.
 
 For example, `OlmoDecoderLayer` has an attribute defined as `self.mlp = OlmoMLP(config)`. This class was never explicitly redefined in `Olmo2MLP`, so the linter automatically created a `Olmo2MLP` class similar to `OlmoMLP`. It is identical to the code below if it was explicitly written in `modular_olmo2.py`.
 
diff --git a/docs/source/en/perf_hardware.md b/docs/source/en/perf_hardware.md
index 4827c40bed..49ba739be2 100644
--- a/docs/source/en/perf_hardware.md
+++ b/docs/source/en/perf_hardware.md
@@ -29,7 +29,7 @@ It is important the PSU has stable voltage otherwise it may not be able to suppl
 
 ## Cooling
 
-An overheated GPU throttles its performance and can even shutdown if it's too hot to prevent damage. Keeping the GPU temperature low, anywhere between 158 - 167F, is essential for delivering full perfomance and maintaining its lifespan. Once temperatures reach 183 - 194F, the GPU may begin to throttle performance.
+An overheated GPU throttles its performance and can even shutdown if it's too hot to prevent damage. Keeping the GPU temperature low, anywhere between 158 - 167F, is essential for delivering full performance and maintaining its lifespan. Once temperatures reach 183 - 194F, the GPU may begin to throttle performance.
 
 ## Multi-GPU connectivity
 
diff --git a/docs/source/en/perf_train_gpu_many.md b/docs/source/en/perf_train_gpu_many.md
index d710508e75..7dfd4cd63c 100644
--- a/docs/source/en/perf_train_gpu_many.md
+++ b/docs/source/en/perf_train_gpu_many.md
@@ -33,7 +33,7 @@ Use the [Model Memory Calculator](https://huggingface.co/spaces/hf-accelerate/mo
 
 ## Data parallelism
 
-Data parallelism evenly distributes data across multiple GPUs. Each GPU holds a copy of the model and concurrently proccesses their portion of the data. At the end, the results from each GPU are synchronized and combined.
+Data parallelism evenly distributes data across multiple GPUs. Each GPU holds a copy of the model and concurrently processes their portion of the data. At the end, the results from each GPU are synchronized and combined.
 
 Data parallelism significantly reduces training time by processing data in parallel, and it is scalable to the number of GPUs available. However, synchronizing results from each GPU can add overhead.
 
diff --git a/docs/source/en/pipeline_tutorial.md b/docs/source/en/pipeline_tutorial.md
index e6857ce297..24fa6275ac 100644
--- a/docs/source/en/pipeline_tutorial.md
+++ b/docs/source/en/pipeline_tutorial.md
@@ -24,7 +24,7 @@ Tailor the [`Pipeline`] to your task with task specific parameters such as addin
 
 Transformers has two pipeline classes, a generic [`Pipeline`] and many individual task-specific pipelines like [`TextGenerationPipeline`] or [`VisualQuestionAnsweringPipeline`]. Load these individual pipelines by setting the task identifier in the `task` parameter in [`Pipeline`]. You can find the task identifier for each pipeline in their API documentation.
 
-Each task is configured to use a default pretrained model and preprocessor, but this can be overriden with the `model` parameter if you want to use a different model.
+Each task is configured to use a default pretrained model and preprocessor, but this can be overridden with the `model` parameter if you want to use a different model.
 
 For example, to use the [`TextGenerationPipeline`] with [Gemma 2](./model_doc/gemma2), set `task="text-generation"` and `model="google/gemma-2-2b"`.
 
diff --git a/docs/source/en/testing.md b/docs/source/en/testing.md
index 9e85f2248e..dd0b9cbb42 100644
--- a/docs/source/en/testing.md
+++ b/docs/source/en/testing.md
@@ -220,7 +220,7 @@ Just run the following line to automatically test every docstring example in the
 ```bash
 pytest --doctest-modules <path_to_file_or_dir>
 ```
-If the file has a markdown extention, you should add the `--doctest-glob="*.md"` argument.
+If the file has a markdown extension, you should add the `--doctest-glob="*.md"` argument.
 
 ### Run only modified tests
 
diff --git a/docs/source/zh/agents.md b/docs/source/zh/agents.md
index 00fa74e654..b10fe43608 100644
--- a/docs/source/zh/agents.md
+++ b/docs/source/zh/agents.md
@@ -233,7 +233,7 @@ Here are a few examples using notional tools:
 ---
 {examples}
 
-Above example were using notional tools that might not exist for you. You only have acces to those tools:
+Above example were using notional tools that might not exist for you. You only have access to those tools:
 <<tool_names>>
 You also can perform computations in the python code you generate.
 
diff --git a/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py b/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py
index 81a7d49765..f3bcaa1562 100644
--- a/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py
+++ b/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py
@@ -265,7 +265,7 @@ class FlaxDataCollatorSpeechSeq2SeqWithPadding:
     Data collator that will dynamically pad the inputs received.
     Args:
         processor ([`Wav2Vec2Processor`])
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         decoder_start_token_id (:obj: `int`)
             The begin-of-sentence of the decoder.
         input_padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
diff --git a/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py b/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
index 62b15c0f31..e64ef98189 100755
--- a/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
+++ b/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
@@ -296,7 +296,7 @@ class DataCollatorForWav2Vec2Pretraining:
             The Wav2Vec2 model used for pretraining. The data collator needs to have access
             to config and ``_get_feat_extract_output_lengths`` function for correct padding.
         feature_extractor (:class:`~transformers.Wav2Vec2FeatureExtractor`):
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
@@ -445,7 +445,7 @@ def main():
     accelerator.wait_for_everyone()
 
     # 1. Download and create train, validation dataset
-    # We load all dataset configuration and datset split pairs passed in
+    # We load all dataset configuration and dataset split pairs passed in
     # ``args.dataset_config_names`` and ``args.dataset_split_names``
     datasets_splits = []
     for dataset_config_name, train_split_name in zip(args.dataset_config_names, args.dataset_split_names):
diff --git a/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py b/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py
index 47352e22bc..f4a6922386 100755
--- a/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py
+++ b/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py
@@ -292,7 +292,7 @@ class DataCollatorCTCWithPadding:
     Data collator that will dynamically pad the inputs received.
     Args:
         processor (:class:`~transformers.AutoProcessor`)
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
diff --git a/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py b/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py
index 062fddffb1..1f18998e93 100755
--- a/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py
+++ b/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py
@@ -275,7 +275,7 @@ class DataCollatorCTCWithPadding:
     Data collator that will dynamically pad the inputs received.
     Args:
         processor (:class:`~transformers.AutoProcessor`)
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
@@ -559,7 +559,7 @@ def main():
                 )
 
                 # if we doing adapter language training, save
-                # vocab with adpter language
+                # vocab with adapter language
                 if data_args.target_language is not None:
                     vocab_dict[data_args.target_language] = lang_dict
 
diff --git a/examples/pytorch/text-classification/run_classification.py b/examples/pytorch/text-classification/run_classification.py
index cfee8ba50b..1af1d86913 100755
--- a/examples/pytorch/text-classification/run_classification.py
+++ b/examples/pytorch/text-classification/run_classification.py
@@ -429,7 +429,7 @@ def main():
     if is_regression:
         label_list = None
         num_labels = 1
-        # regession requires float as label type, let's cast it if needed
+        # regression requires float as label type, let's cast it if needed
         for split in raw_datasets.keys():
             if raw_datasets[split].features["label"].dtype not in ["float32", "float64"]:
                 logger.warning(
diff --git a/examples/pytorch/text-generation/README.md b/examples/pytorch/text-generation/README.md
index 72fc25e13c..b96bcd9224 100644
--- a/examples/pytorch/text-generation/README.md
+++ b/examples/pytorch/text-generation/README.md
@@ -19,7 +19,7 @@ limitations under the License.
 Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-generation/run_generation.py).
 
 Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, GPT-J, Transformer-XL, XLNet, CTRL, BLOOM, LLAMA, OPT.
-A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
+A similar script is used for our official demo [Write With Transformer](https://transformer.huggingface.co), where you
 can try out the different models available in the library.
 
 Example usage:
diff --git a/examples/pytorch/token-classification/README.md b/examples/pytorch/token-classification/README.md
index b880b82030..734a1a1d1a 100644
--- a/examples/pytorch/token-classification/README.md
+++ b/examples/pytorch/token-classification/README.md
@@ -19,7 +19,7 @@ limitations under the License.
 ## PyTorch version
 
 Fine-tuning the library models for token classification task such as Named Entity Recognition (NER), Parts-of-speech
-tagging (POS) or phrase extraction (CHUNKS). The main scrip `run_ner.py` leverages the 🤗 Datasets library and the Trainer API. You can easily
+tagging (POS) or phrase extraction (CHUNKS). The main script `run_ner.py` leverages the 🤗 Datasets library and the Trainer API. You can easily
 customize it to your needs if you need extra processing on your datasets.
 
 It will either run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own text files for
diff --git a/examples/research_projects/bertabs/configuration_bertabs.py b/examples/research_projects/bertabs/configuration_bertabs.py
index 4c65cd3395..3e7222d490 100644
--- a/examples/research_projects/bertabs/configuration_bertabs.py
+++ b/examples/research_projects/bertabs/configuration_bertabs.py
@@ -37,7 +37,7 @@ class BertAbsConfig(PretrainedConfig):
         max_pos: int
             The maximum sequence length that this model will be used with.
         enc_layer: int
-            The numner of hidden layers in the Transformer encoder.
+            The number of hidden layers in the Transformer encoder.
         enc_hidden_size: int
             The size of the encoder's layers.
         enc_heads: int
@@ -49,7 +49,7 @@ class BertAbsConfig(PretrainedConfig):
             embeddings, layers, pooler and also the attention probabilities in
             the encoder.
         dec_layer: int
-            The numner of hidden layers in the decoder.
+            The number of hidden layers in the decoder.
         dec_hidden_size: int
             The size of the decoder's layers.
         dec_heads: int
diff --git a/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py b/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py
index 338ffd21c9..f6222d35d4 100644
--- a/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py
+++ b/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py
@@ -130,7 +130,7 @@ def convert_bertabs_checkpoints(path_to_checkpoints, dump_path):
     mask_tgt = decoder_attention_mask = None
     mask_cls = None
 
-    # The original model does not apply the geneator layer immediatly but rather in
+    # The original model does not apply the generator layer immediatly but rather in
     # the beam search (where it combines softmax + linear layer). Since we already
     # apply the softmax in our generation process we only apply the linear layer here.
     # We make sure that the outputs of the full stack are identical
@@ -143,9 +143,9 @@ def convert_bertabs_checkpoints(path_to_checkpoints, dump_path):
     output_converted_generator = new_model.generator(output_converted_model)
 
     maximum_absolute_difference = torch.max(torch.abs(output_converted_model - output_original_model)).item()
-    print("Maximum absolute difference beween weights: {:.2f}".format(maximum_absolute_difference))
+    print("Maximum absolute difference between weights: {:.2f}".format(maximum_absolute_difference))
     maximum_absolute_difference = torch.max(torch.abs(output_converted_generator - output_original_generator)).item()
-    print("Maximum absolute difference beween weights: {:.2f}".format(maximum_absolute_difference))
+    print("Maximum absolute difference between weights: {:.2f}".format(maximum_absolute_difference))
 
     are_identical = torch.allclose(output_converted_model, output_original_model, atol=1e-3)
     if are_identical:
diff --git a/examples/research_projects/bertabs/modeling_bertabs.py b/examples/research_projects/bertabs/modeling_bertabs.py
index c2c6a54be7..d65a0ca59d 100644
--- a/examples/research_projects/bertabs/modeling_bertabs.py
+++ b/examples/research_projects/bertabs/modeling_bertabs.py
@@ -390,7 +390,7 @@ class MultiHeadedAttention(nn.Module):
     :cite:`DBLP:journals/corr/VaswaniSPUJGKP17`.
 
     Similar to standard `dot` attention but uses
-    multiple attention distributions simulataneously
+    multiple attention distributions simultaneously
     to select relevant items.
 
     .. mermaid::
diff --git a/examples/research_projects/bertabs/run_summarization.py b/examples/research_projects/bertabs/run_summarization.py
index 1f969f117b..bc13de5589 100644
--- a/examples/research_projects/bertabs/run_summarization.py
+++ b/examples/research_projects/bertabs/run_summarization.py
@@ -260,7 +260,7 @@ def main():
         default=None,
         type=str,
         required=False,
-        help="The folder in wich the summaries should be written. Defaults to the folder where the documents are",
+        help="The folder in which the summaries should be written. Defaults to the folder where the documents are",
     )
     parser.add_argument(
         "--compute_rouge",
@@ -315,7 +315,7 @@ def main():
     )
     args = parser.parse_args()
 
-    # Select device (distibuted not available)
+    # Select device (distributed not available)
     args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
 
     # Check the existence of directories
diff --git a/examples/research_projects/codeparrot/scripts/codeparrot_training.py b/examples/research_projects/codeparrot/scripts/codeparrot_training.py
index 16f6077f24..549627d6ca 100644
--- a/examples/research_projects/codeparrot/scripts/codeparrot_training.py
+++ b/examples/research_projects/codeparrot/scripts/codeparrot_training.py
@@ -24,7 +24,7 @@ class ConstantLengthDataset(IterableDataset):
     """
     Iterable dataset that returns constant length chunks of tokens from stream of text files.
         Args:
-            tokenizer (Tokenizer): The processor used for proccessing the data.
+            tokenizer (Tokenizer): The processor used for processing the data.
             dataset (dataset.Dataset): Dataset with text files.
             infinite (bool): If True the iterator is reset after dataset reaches end else stops.
             seq_length (int): Length of token sequences to return.
diff --git a/examples/research_projects/codeparrot/scripts/preprocessing.py b/examples/research_projects/codeparrot/scripts/preprocessing.py
index d9cac5abfd..3e932c8ef6 100644
--- a/examples/research_projects/codeparrot/scripts/preprocessing.py
+++ b/examples/research_projects/codeparrot/scripts/preprocessing.py
@@ -84,7 +84,7 @@ def is_config_or_test(example, scan_width=5, coeff=0.05):
 
 
 def has_no_keywords(example):
-    """Check if a python file has none of the keywords for: funcion, class, for loop, while loop."""
+    """Check if a python file has none of the keywords for: function, class, for loop, while loop."""
     keywords = ["def ", "class ", "for ", "while "]
     lines = example["content"].splitlines()
     for line in lines:
diff --git a/examples/research_projects/performer/modeling_flax_performer_utils.py b/examples/research_projects/performer/modeling_flax_performer_utils.py
index 24c5e4d7c7..c524250938 100644
--- a/examples/research_projects/performer/modeling_flax_performer_utils.py
+++ b/examples/research_projects/performer/modeling_flax_performer_utils.py
@@ -252,7 +252,7 @@ def make_fast_generalized_attention(
     unidirectional=False,
     lax_scan_unroll=1,
 ):
-    """Construct a fast generalized attention menthod."""
+    """Construct a fast generalized attention method."""
     logging.info("Fast generalized attention.: %s features and renormalize=%s", nb_features, renormalize_attention)
     if features_type == "ortho":
         matrix_creator = functools.partial(GaussianOrthogonalRandomMatrix, nb_features, qkv_dim, scaling=False)
diff --git a/examples/research_projects/rag-end2end-retriever/README.md b/examples/research_projects/rag-end2end-retriever/README.md
index 9bff4e8c29..9aa0bc5dbc 100644
--- a/examples/research_projects/rag-end2end-retriever/README.md
+++ b/examples/research_projects/rag-end2end-retriever/README.md
@@ -11,7 +11,7 @@ Please read the [accompanying blog post](https://shamanesiri.medium.com/how-to-f
 The original RAG code has also been modified to work with the latest versions of pytorch lightning (version 1.2.10) and RAY (version 1.3.0). All other implementation details remain the same as the [original RAG code](https://github.com/huggingface/transformers/tree/main/examples/research_projects/rag).
 Read more about RAG  at https://arxiv.org/abs/2005.11401.
 
-This code can be modified to experiment with other research on retrival augmented models which include training of the retriever (e.g. [REALM](https://arxiv.org/abs/2002.08909) and [MARGE](https://arxiv.org/abs/2006.15020)).
+This code can be modified to experiment with other research on retrieval augmented models which include training of the retriever (e.g. [REALM](https://arxiv.org/abs/2002.08909) and [MARGE](https://arxiv.org/abs/2006.15020)).
 
 To start training, use the bash script (finetune_rag_ray_end2end.sh) in this folder. This script also includes descriptions on each command-line argument used.
 
diff --git a/examples/research_projects/rag-end2end-retriever/lightning_base.py b/examples/research_projects/rag-end2end-retriever/lightning_base.py
index 9c918eea47..c1a271e88d 100644
--- a/examples/research_projects/rag-end2end-retriever/lightning_base.py
+++ b/examples/research_projects/rag-end2end-retriever/lightning_base.py
@@ -134,7 +134,7 @@ class BaseTransformer(pl.LightningModule):
             {
                 "params": [
                     p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)
-                ],  # check this named paramters
+                ],  # check this named parameters
                 "weight_decay": self.hparams.weight_decay,
             },
             {
@@ -279,7 +279,7 @@ class InitCallback(pl.Callback):
 
 
 class CheckParamCallback(pl.Callback):
-    # check whether new added model paramters are differentiable
+    # check whether new added model parameters are differentiable
     def on_after_backward(self, trainer, pl_module):
         # print(pl_module.model.rag)
         for name, param in pl_module.model.rag.named_parameters():
diff --git a/examples/research_projects/rag/README.md b/examples/research_projects/rag/README.md
index 7fbaea84b9..59aa46a895 100644
--- a/examples/research_projects/rag/README.md
+++ b/examples/research_projects/rag/README.md
@@ -98,7 +98,7 @@ Our evaluation script enables two modes of evaluation (controlled by the `eval_m
 
 The evaluation script expects paths to two files:
 - `evaluation_set` - a path to a file specifying the evaluation dataset, a single input per line.
-- `gold_data_path` - a path to a file contaning ground truth answers for datapoints from the `evaluation_set`, a single output per line. Check below for expected formats of the gold data files.
+- `gold_data_path` - a path to a file containing ground truth answers for datapoints from the `evaluation_set`, a single output per line. Check below for expected formats of the gold data files.
 
 
 ## Retrieval evaluation
diff --git a/examples/research_projects/rag/distributed_pytorch_retriever.py b/examples/research_projects/rag/distributed_pytorch_retriever.py
index e2403ff8e5..b8c4b6fc3c 100644
--- a/examples/research_projects/rag/distributed_pytorch_retriever.py
+++ b/examples/research_projects/rag/distributed_pytorch_retriever.py
@@ -70,7 +70,7 @@ class RagPyTorchDistributedRetriever(RagRetriever):
             logger.info("dist not initialized / main")
             self.index.init_index()
 
-        # all processes wait untill the retriever is initialized by the main process
+        # all processes wait until the retriever is initialized by the main process
         if dist.is_initialized():
             torch.distributed.barrier(group=self.process_group)
 
diff --git a/examples/research_projects/rag/finetune_rag.py b/examples/research_projects/rag/finetune_rag.py
index 7f4778d7d7..af3acd4def 100644
--- a/examples/research_projects/rag/finetune_rag.py
+++ b/examples/research_projects/rag/finetune_rag.py
@@ -458,7 +458,7 @@ class GenerativeQAModule(BaseTransformer):
             default=None,
             help=(
                 "Name of the index to use: 'hf' for a canonical dataset from the datasets library (default), 'custom'"
-                " for a local index, or 'legacy' for the orignal one)"
+                " for a local index, or 'legacy' for the original one)"
             ),
         )
         parser.add_argument(
diff --git a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py
index e95bc185e4..cb489ea28d 100755
--- a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py
+++ b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py
@@ -266,7 +266,7 @@ class DataCollatorCTCWithPadding:
     Data collator that will dynamically pad the inputs received.
     Args:
         processor (:class:`~transformers.AutoProcessor`)
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
diff --git a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py
index 0fb567aba0..37f91b9ef6 100644
--- a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py
+++ b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py
@@ -257,7 +257,7 @@ class DataCollatorCTCWithPadding:
     Data collator that will dynamically pad the inputs received.
     Args:
         processor (:class:`~transformers.AutoProcessor`)
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
diff --git a/examples/research_projects/wav2vec2/run_asr.py b/examples/research_projects/wav2vec2/run_asr.py
index 6535e3485d..796d271583 100755
--- a/examples/research_projects/wav2vec2/run_asr.py
+++ b/examples/research_projects/wav2vec2/run_asr.py
@@ -226,7 +226,7 @@ class DataCollatorCTCWithPadding:
     Data collator that will dynamically pad the inputs received.
     Args:
         processor (:class:`~transformers.Wav2Vec2Processor`)
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
diff --git a/examples/research_projects/wav2vec2/run_common_voice.py b/examples/research_projects/wav2vec2/run_common_voice.py
index a7f57960d8..09a8458ca2 100644
--- a/examples/research_projects/wav2vec2/run_common_voice.py
+++ b/examples/research_projects/wav2vec2/run_common_voice.py
@@ -145,7 +145,7 @@ class DataCollatorCTCWithPadding:
     Data collator that will dynamically pad the inputs received.
     Args:
         processor (:class:`~transformers.Wav2Vec2Processor`)
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
diff --git a/examples/research_projects/wav2vec2/run_pretrain.py b/examples/research_projects/wav2vec2/run_pretrain.py
index 985e6df40e..00ef4edb37 100755
--- a/examples/research_projects/wav2vec2/run_pretrain.py
+++ b/examples/research_projects/wav2vec2/run_pretrain.py
@@ -142,7 +142,7 @@ class DataCollatorForWav2Vec2Pretraining:
             The Wav2Vec2 model used for pretraining. The data collator needs to have access
             to config and ``_get_feat_extract_output_lengths`` function for correct padding.
         feature_extractor (:class:`~transformers.Wav2Vec2FeatureExtractor`):
-            The processor used for proccessing the data.
+            The processor used for processing the data.
         padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among: