From e5ca9b057cad40d3c47595f7d2c1747c76a83542 Mon Sep 17 00:00:00 2001 From: omahs <73983677+omahs@users.noreply.github.com> Date: Mon, 8 Jul 2024 12:52:47 +0200 Subject: [PATCH] Fix typos (#31819) * fix typo * fix typo * fix typos * fix typo * fix typos --- docs/source/en/deepspeed.md | 12 ++++++------ docs/source/en/glossary.md | 4 ++-- docs/source/en/llm_tutorial_optimization.md | 4 ++-- docs/source/en/perf_hardware.md | 2 +- utils/diff_model_converter.py | 2 +- 5 files changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/source/en/deepspeed.md b/docs/source/en/deepspeed.md index 868021a9cd..7f7995c466 100644 --- a/docs/source/en/deepspeed.md +++ b/docs/source/en/deepspeed.md @@ -16,11 +16,11 @@ rendered properly in your Markdown viewer. # DeepSpeed -[DeepSpeed](https://www.deepspeed.ai/) is a PyTorch optimization library that makes distributed training memory-efficient and fast. At it's core is the [Zero Redundancy Optimizer (ZeRO)](https://hf.co/papers/1910.02054) which enables training large models at scale. ZeRO works in several stages: +[DeepSpeed](https://www.deepspeed.ai/) is a PyTorch optimization library that makes distributed training memory-efficient and fast. At its core is the [Zero Redundancy Optimizer (ZeRO)](https://hf.co/papers/1910.02054) which enables training large models at scale. ZeRO works in several stages: -* ZeRO-1, optimizer state partioning across GPUs +* ZeRO-1, optimizer state partitioning across GPUs * ZeRO-2, gradient partitioning across GPUs -* ZeRO-3, parameteter partitioning across GPUs +* ZeRO-3, parameter partitioning across GPUs In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers [`Trainer`] class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models. @@ -159,7 +159,7 @@ There are three types of configuration parameters: You could also modify the DeepSpeed configuration and edit [`TrainingArguments`] from it: -1. Create or load a DeepSpeed configuration to used as the main configuration +1. Create or load a DeepSpeed configuration to use as the main configuration 2. Create a [`TrainingArguments`] object based on these DeepSpeed configuration values Some values, such as `scheduler.params.total_num_steps` are calculated by the [`Trainer`] during training. @@ -191,7 +191,7 @@ ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed -ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since it's features are not relevant to inference. Some important parameters to configure for better performance include: +ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since its features are not relevant to inference. Some important parameters to configure for better performance include: * `offload_optimizer` should be enabled to reduce GPU memory usage. * `overlap_comm` when set to `true` trades off increased GPU memory usage to lower allreduce latency. This feature uses 4.5x the `allgather_bucket_size` and `reduce_bucket_size` values. In this example, they're set to `5e8` which means it requires 9GB of GPU memory. If your GPU memory is 8GB or less, you should reduce `overlap_comm` to lower the memory requirements and prevent an out-of-memory (OOM) error. @@ -226,7 +226,7 @@ ZeRO-3 shards the optimizer, gradient, and parameters across GPUs. Unlike ZeRO-2 * `pin_memory: true` can improve throughput, but less memory becomes available for other processes because the pinned memory is reserved for the specific process that requested it and it's typically accessed much faster than normal CPU memory. * `stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given time. Reduce this value if you encounter an OOM error. * `stage3_max_reuse_distance` is a value for determining when a parameter is used again in the future, and it helps decide whether to throw the parameter away or to keep it. If the parameter is going to be reused (if the value is less than `stage3_max_reuse_distance`), then it is kept to reduce communication overhead. This is super helpful when activation checkpointing is enabled and you want to keep the parameter in the forward recompute until the backward pass. But reduce this value if you encounter an OOM error. -* `stage3_gather_16bit_weights_on_model_save` consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is an expensive in terms of memory and speed. You should enable it if you're planning on resuming training. +* `stage3_gather_16bit_weights_on_model_save` consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is expensive in terms of memory and speed. You should enable it if you're planning on resuming training. * `sub_group_size` controls which parameters are updated during the optimizer step. Parameters are grouped into buckets of `sub_group_size` and each bucket is updated one at a time. When used with NVMe offload, `sub_group_size` determines when model states are moved in and out of CPU memory from during the optimization step. This prevents running out of CPU memory for extremely large models. `sub_group_size` can be left to its default value if you aren't using NVMe offload, but you may want to change it if you: 1. Run into an OOM error during the optimizer step. In this case, reduce `sub_group_size` to reduce memory usage of the temporary buffers. diff --git a/docs/source/en/glossary.md b/docs/source/en/glossary.md index f3c2c50d70..d9fdac2475 100644 --- a/docs/source/en/glossary.md +++ b/docs/source/en/glossary.md @@ -139,7 +139,7 @@ reading the whole sentence with a mask to hide future tokens at a certain timest ### deep learning (DL) -Machine learning algorithms which uses neural networks with several layers. +Machine learning algorithms which use neural networks with several layers. ## E @@ -519,4 +519,4 @@ A form of model training in which data provided to the model is not labeled. Uns Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensor-parallelism-tp), except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need to be modified. This method also supports various offloading techniques to compensate for limited GPU memory. -Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism). \ No newline at end of file +Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism). diff --git a/docs/source/en/llm_tutorial_optimization.md b/docs/source/en/llm_tutorial_optimization.md index 93848d72b0..23086929f6 100644 --- a/docs/source/en/llm_tutorial_optimization.md +++ b/docs/source/en/llm_tutorial_optimization.md @@ -147,7 +147,7 @@ Let's call it now for the next experiment. ```python flush() ``` -In the recent version of the accelerate library, you can also use an utility method called `release_memory()` +In the recent version of the accelerate library, you can also use a utility method called `release_memory()` ```python from accelerate.utils import release_memory @@ -683,7 +683,7 @@ Assistant: Germany has ca. 81 million inhabitants In this chat, the LLM runs auto-regressive decoding twice: 1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step. - 2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, it's computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`. + 2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`. Two things should be noted here: 1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`. diff --git a/docs/source/en/perf_hardware.md b/docs/source/en/perf_hardware.md index c42b58483b..eb09ab439b 100644 --- a/docs/source/en/perf_hardware.md +++ b/docs/source/en/perf_hardware.md @@ -116,7 +116,7 @@ Each new generation provides a faster bandwidth, e.g. here is a quote from [Nvid So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture. -Let's compare the execution of a openai-community/gpt2 language model training over a small sample of wikitext. +Let's compare the execution of an openai-community/gpt2 language model training over a small sample of wikitext. The results are: diff --git a/utils/diff_model_converter.py b/utils/diff_model_converter.py index e86c6405d4..f05c57581c 100644 --- a/utils/diff_model_converter.py +++ b/utils/diff_model_converter.py @@ -497,7 +497,7 @@ class DiffConverterTransformer(CSTTransformer): start_insert_idx -= 1 self.new_body[dependency] = {"insert_idx": start_insert_idx, "node": node} elif dependency not in self.inserted_deps: - # make sure the node is written after it's dependencies + # make sure the node is written after its dependencies start_insert_idx = self.new_body[dependency]["insert_idx"] - 1 self.inserted_deps.append(dependency) if len(list_dependencies) > 0: