From de182ba2690fe6c3466f6463c7f4b3a61694b885 Mon Sep 17 00:00:00 2001 From: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Date: Tue, 22 Apr 2025 16:13:25 +0200 Subject: [PATCH] Refactor bitsandbytes doc (#37668) * doc * torch ops * fix * nits * Update docs/source/en/quantization/bitsandbytes.md Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> --- docs/source/en/quantization/bitsandbytes.md | 68 ++++++++++++++++++--- 1 file changed, 60 insertions(+), 8 deletions(-) diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md index e9b581b89f..3ffdd89f2f 100644 --- a/docs/source/en/quantization/bitsandbytes.md +++ b/docs/source/en/quantization/bitsandbytes.md @@ -14,13 +14,21 @@ rendered properly in your Markdown viewer. --> -# bitsandbytes +# Bitsandbytes -[bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) features the LLM.int8 and QLoRA quantization to enable accessible large language model inference and training. +The [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library provides quantization tools for LLMs through a lightweight Python wrapper around CUDA functions. It enables working with large models using limited computational resources by reducing their memory footprint. -[LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language model inference more accessible without significant degradation. Unlike naive 8-bit quantization, which can result in loss of critical information and accuracy, LLM.int8() dynamically adapts to ensure sensitive components of the computation retain higher precision when needed. +At its core, bitsandbytes provides: -QLoRA, or 4-bit quantization, compresses a model even further to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allowing training. +- **Quantized Linear Layers**: `Linear8bitLt` and `Linear4bit` layers that replace standard PyTorch linear layers with memory-efficient quantized alternatives +- **Optimized Optimizers**: 8-bit versions of common optimizers through its `optim` module, enabling training of large models with reduced memory requirements +- **Matrix Multiplication**: Optimized matrix multiplication operations that leverage the quantized format + +bitsandbytes offers two main quantization features: + +1. **LLM.int8()** - An 8-bit quantization method that makes inference more accessible without significant performance degradation. Unlike naive quantization, [LLM.int8()](https://hf.co/papers/2208.07339) dynamically preserves higher precision for critical computations, preventing information loss in sensitive parts of the model. + +2. **QLoRA** - A 4-bit quantization technique that compresses models even further while maintaining trainability by inserting a small set of trainable low-rank adaptation (LoRA) weights. > **Note:** For a user-friendly quantization experience, you can use the `bitsandbytes` [community space](https://huggingface.co/spaces/bnb-community/bnb-my-repo). @@ -30,12 +38,38 @@ Run the command below to install bitsandbytes. ```bash pip install --upgrade transformers accelerate bitsandbytes ``` +To compile from source, follow the instructions in the [bitsandbytes installation guide](https://huggingface.co/docs/bitsandbytes/main/en/installation). + +## Hardware Compatibility +bitsandbytes is currently only supported on CUDA GPUs for CUDA versions 11.0 - 12.8. However, there's an ongoing multi-backend effort under development, which is currently in alpha. If you're interested in providing feedback or testing, check out the [bitsandbytes repository](https://github.com/bitsandbytes-foundation/bitsandbytes) for more information. + +### CUDA + +| Feature | Minimum Hardware Requirement | +|---------|-------------------------------| +| 8-bit optimizers | NVIDIA Maxwell (GTX 900 series, TITAN X, M40) or newer GPUs * | +| LLM.int8() | NVIDIA Turing (RTX 20 series, T4) or newer GPUs | +| NF4/FP4 quantization | NVIDIA Maxwell (GTX 900 series, TITAN X, M40) or newer GPUs * | + +### Multi-backend + +| Backend | Supported Versions | Python versions | Architecture Support | Status | +|---------|-------------------|----------------|---------------------|---------| +| AMD ROCm | 6.1+ | 3.10+ | minimum CDNA - gfx90a, RDNA - gfx1100 | Alpha | +| Apple Silicon (MPS) | WIP | 3.10+ | M1/M2 chips | Planned | +| Intel CPU | v2.4.0+ (ipex) | 3.10+ | Intel CPU | Alpha | +| Intel GPU | v2.4.0+ (ipex) | 3.10+ | Intel GPU | Experimental | +| Ascend NPU | 2.1.0+ (torch_npu) | 3.10+ | Ascend NPU | Experimental | + +> **Note:** Bitsandbytes is moving away from the multi-backend approach towards using [Pytorch Custom Operators](https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html), as the main mechanism for supporting new hardware, and dispatching to the correct backend. + +## Quantization Examples Quantize a model by passing a [`BitsAndBytesConfig`] to [`~PreTrainedModel.from_pretrained`]. This works for any model in any modality, as long as it supports [Accelerate](https://huggingface.co/docs/accelerate/index) and contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers. - +
Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs. ```py @@ -45,6 +79,7 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True) model_8bit = AutoModelForCausalLM.from_pretrained( "bigscience/bloom-1b7", + device_map="auto", quantization_config=quantization_config ) ``` @@ -59,6 +94,7 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True) model_8bit = AutoModelForCausalLM.from_pretrained( "facebook/opt-350m", + device_map="auto", quantization_config=quantization_config, torch_dtype="auto" ) @@ -74,16 +110,16 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained( "bigscience/bloom-560m", + device_map="auto", quantization_config=quantization_config ) -tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m") model.push_to_hub("bloom-560m-8bit") ``` - +
- +
Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs. ```py @@ -93,6 +129,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True) model_4bit = AutoModelForCausalLM.from_pretrained( "bigscience/bloom-1b7", + device_map="auto", quantization_config=quantization_config ) ``` @@ -107,6 +144,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True) model_4bit = AutoModelForCausalLM.from_pretrained( "facebook/opt-350m", + device_map="auto", quantization_config=quantization_config, torch_dtype="auto" ) @@ -115,6 +153,20 @@ model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype Make sure you have the latest bitsandbytes version so you can serialize 4-bit models and push them to the Hub with [`~PreTrainedModel.push_to_hub`]. Use [`~PreTrainedModel.save_pretrained`] to save the 4-bit model locally. +```py +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_4bit=True) + +model = AutoModelForCausalLM.from_pretrained( + "bigscience/bloom-560m", + device_map="auto", + quantization_config=quantization_config +) + +model.push_to_hub("bloom-560m-4bit") +``` +