Update quantization docs (#37439)

2025-04-16 09:44:53 -04:00
parent a91020aed0
commit 9ddcf5fce5
3 changed files with 202 additions and 0 deletions
--- a/docs/source/en/quantization/selecting.md
+++ b/docs/source/en/quantization/selecting.md
@@ -130,6 +130,28 @@ Methods like [AQLM](./aqlm), [SpQR](./spqr), [VPTQ](./vptq), [HIGGS](./higgs), e
    *   You have significant compute resources available for potentially complex quantization procedures.
 We recommend consulting each methods documentation and associated papers carefully before choosing one for use in production.

+## Benchmark Comparison
+
+To provide a quantitative comparison of different quantization methods, we benchmarked several popular techniques on the Llama 3.1 8B and 70B models. The following tables show results for accuracy (higher is better), inference throughput measured in tokens/second (higher is better), peak VRAM usage measured in GB (lower is better), and quantization time.
+
+Performance metrics were measured on 2 NVIDIA A100 80GB GPU for Llama 3.1 70B (bfloat16), 1 NVIDIA H100 80GB GPU for FP8 methods, and 1 NVIDIA A100 80GB GPU for all other methods. Throughput was measured with a batch size of 1 and generating 64 tokens.
+Results for `torch.compile` and Marlin kernels are included where applicable and supported.
+
+<iframe
+  src="https://huggingface.co/datasets/derekl35/quantization-benchmarks/embed/viewer/default/train"
+  frameborder="0"
+  width="100%"
+  height="560px"
+  title="benchmarking results dataset"
+></iframe>
+
+The key takeaways are:
+
+| Quantization & Methods                      | Memory Savings (vs bf16) | Accuracy             | Other Notes                                                        |
+|-------------------------------------------- |------------------------- |--------------------- |------------------------------------------------------------------- |
+| **8-bit** (bnb-int8, HQQ, Quanto, torchao, fp8) | ~2x             | Very close to baseline bf16 model   |                                                                    |
+| **4-bit** (AWQ, GPTQ, HQQ, bnb-nf4)    | ~4x                      | Relatively high accuracy            | AWQ/GPTQ often lead in accuracy but need calibration. HQQ/bnb-nf4 are easy on-the-fly. |
+| **Sub-4-bit** (VPTQ, AQLM, 2-bit GPTQ) | Extreme (>4x)            | Noticeable drop, especially at 2-bit | Quantization times can be very long (AQLM, VPTQ). Performance varies. |

 > [!TIP]
 > Always benchmark the performance (accuracy and speed) of the quantized model on your specific task and hardware to ensure it meets your requirements. Refer to the individual documentation pages linked above for detailed usage instructions.