From a8531f3bfd3c1dd63392c8c0c470c6744c8969ae Mon Sep 17 00:00:00 2001 From: titi <66329321+titi-devv@users.noreply.github.com> Date: Tue, 26 Sep 2023 10:11:28 +0200 Subject: [PATCH] Deleted duplicate sentence (#26394) --- docs/source/en/perf_infer_gpu_one.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md index 86e137cf14..f0c0bf0b10 100644 --- a/docs/source/en/perf_infer_gpu_one.md +++ b/docs/source/en/perf_infer_gpu_one.md @@ -68,8 +68,6 @@ You can benefit from considerable speedups for fine-tuning and inference, especi To overcome this, one should use Flash Attention without padding tokens in the sequence for training (e.g., by packing a dataset, i.e., concatenating sequences until reaching the maximum sequence length. An example is provided [here](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L516). -Below is the expected speedup you can get for a simple forward pass on [tiiuae/falcon-7b](https://hf.co/tiiuae/falcon-7b) with a sequence length of 4096 and various batch sizes without padding tokens: - Below is the expected speedup you can get for a simple forward pass on [tiiuae/falcon-7b](https://hf.co/tiiuae/falcon-7b) with a sequence length of 4096 and various batch sizes, without padding tokens: