From 90f256c90c927a804847bb825b145b282d81b495 Mon Sep 17 00:00:00 2001 From: Martin Date: Sun, 29 Dec 2024 21:57:08 +0800 Subject: [PATCH] Update perf_infer_gpu_one.md: fix a typo (#35441) --- docs/source/en/perf_infer_gpu_one.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md index d794509641..1b7b2a39b6 100644 --- a/docs/source/en/perf_infer_gpu_one.md +++ b/docs/source/en/perf_infer_gpu_one.md @@ -462,7 +462,7 @@ generated_ids = model.generate(**inputs) outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) ``` -To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU: +To load a model in 8-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU: ```py max_memory_mapping = {0: "1GB", 1: "2GB"}