[docs] Redesign (#31757)
* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
This commit is contained in:
@@ -16,19 +16,17 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# AQLM
|
||||
|
||||
> [!TIP]
|
||||
> Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)!
|
||||
Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
|
||||
|
||||
Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
|
||||
AQLM also supports fine-tuning with [LoRA](https://huggingface.co/docs/peft/package_reference/lora) with the [PEFT](https://huggingface.co/docs/peft) library, and is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training.
|
||||
|
||||
Run the command below to install the AQLM library with kernel support for both GPU and CPU inference and training. AQLM only works with Python 3.10+.
|
||||
|
||||
Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10):
|
||||
```bash
|
||||
pip install aqlm[gpu,cpu]
|
||||
```
|
||||
|
||||
The library provides efficient kernels for both GPU and CPU inference and training.
|
||||
|
||||
The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub [repository](https://github.com/Vahe1994/AQLM). To run AQLM models simply load a model that has been quantized with AQLM:
|
||||
Load an AQLM-quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
@@ -38,20 +36,21 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
torch_dtype="auto",
|
||||
device_map="auto"
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")
|
||||
```
|
||||
|
||||
## PEFT
|
||||
## Configurations
|
||||
|
||||
Starting with version `aqlm 1.0.2`, AQLM supports Parameter-Efficient Fine-Tuning in a form of [LoRA](https://huggingface.co/docs/peft/package_reference/lora) integrated into the [PEFT](https://huggingface.co/blog/peft) library.
|
||||
AQLM quantization setups vary mainly in the number of codebooks used, as well as codebook sizes in bits. The most popular setups and supported inference kernels are shown below.
|
||||
|
||||
## AQLM configurations
|
||||
|
||||
AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
|
||||
|
||||
| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference |
|
||||
|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
|
||||
| Triton | K | N | KxN | - | Up to ~0.7x | ✅ | ❌ |
|
||||
| CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x | ✅ | ❌ |
|
||||
| CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x | ✅ | ❌ |
|
||||
| Numba | K | 8 | Kx8 | Good | Up to ~4.0x | ❌ | ✅ |
|
||||
|
||||
## Resources
|
||||
|
||||
Run the AQLM demo [notebook](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing) for more examples of how to quantize a model, push a quantized model to the Hub, and more.
|
||||
|
||||
For more example demo notebooks, visit the AQLM [repository](https://github.com/Vahe1994/AQLM).
|
||||
|
||||
@@ -16,17 +16,11 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# AWQ
|
||||
|
||||
<Tip>
|
||||
|
||||
Try AWQ quantization with this [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)!
|
||||
|
||||
</Tip>
|
||||
|
||||
[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation.
|
||||
[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) preserves a small fraction of the weights that are important for LLM performance to compress a model to 4-bits with minimal performance degradation.
|
||||
|
||||
There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the process is similar for llm-awq quantized models.
|
||||
|
||||
Make sure you have autoawq installed:
|
||||
Run the command below to install autoawq
|
||||
|
||||
```bash
|
||||
pip install autoawq
|
||||
@@ -34,7 +28,7 @@ pip install autoawq
|
||||
> [!WARNING]
|
||||
> AutoAWQ downgrades Transformers to version 4.47.1. If you want to do inference with AutoAWQ, you may need to reinstall your Transformers' version after installing AutoAWQ.
|
||||
|
||||
AWQ-quantized models can be identified by checking the `quantization_config` attribute in the model's [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file:
|
||||
Identify an AWQ-quantized model by checking the `quant_method` key in the models [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file.
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -55,63 +49,60 @@ AWQ-quantized models can be identified by checking the `quantization_config` att
|
||||
}
|
||||
```
|
||||
|
||||
A quantized model is loaded with the [`~PreTrainedModel.from_pretrained`] method. If you loaded your model on the CPU, make sure to move it to a GPU device first. Use the `device_map` parameter to specify where to place the model:
|
||||
Load the AWQ-quantized model with [`~PreTrainedModel.from_pretrained`]. This automatically sets the other weights to fp16 by default for performance reasons. Use the `torch_dtype` parameter to load these other weights in a different format.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_id = "TheBloke/zephyr-7B-alpha-AWQ"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
|
||||
```
|
||||
|
||||
Loading an AWQ-quantized model automatically sets other weights to fp16 by default for performance reasons. If you want to load these other weights in a different format, use the `torch_dtype` parameter:
|
||||
If the model is loaded on the CPU, use the `device_map` parameter to move it to a GPU.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
import torch
|
||||
|
||||
model_id = "TheBloke/zephyr-7B-alpha-AWQ"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/zephyr-7B-alpha-AWQ",
|
||||
torch_dtype=torch.float32,
|
||||
device_map="cuda:0"
|
||||
)
|
||||
```
|
||||
|
||||
AWQ quantization can also be combined with [FlashAttention-2](../perf_infer_gpu_one#flashattention-2) to further accelerate inference:
|
||||
Use `attn_implementation` to enable [FlashAttention2](../perf_infer_gpu_one#flashattention-2) to further accelerate inference.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/zephyr-7B-alpha-AWQ",
|
||||
attn_implementation="flash_attention_2",
|
||||
device_map="cuda:0"
|
||||
)
|
||||
```
|
||||
|
||||
## Fused modules
|
||||
|
||||
Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures.
|
||||
Fused modules offer improved accuracy and performance. They are supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Fused modules cannot be combined with other optimization techniques such as FlashAttention-2.
|
||||
|
||||
</Tip>
|
||||
> [!WARNING]
|
||||
> Fused modules cannot be combined with other optimization techniques such as FlashAttention2.
|
||||
|
||||
<hfoptions id="fuse">
|
||||
<hfoption id="supported architectures">
|
||||
|
||||
To enable fused modules for supported architectures, create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True`. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. You can set it to a larger value to be safe.
|
||||
Create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True` to enable fused modules. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. Set it to a larger value to be safe.
|
||||
|
||||
For example, to fuse the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model.
|
||||
The example below fuses the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AwqConfig, AutoModelForCausalLM
|
||||
|
||||
model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"
|
||||
|
||||
quantization_config = AwqConfig(
|
||||
bits=4,
|
||||
fuse_max_seq_len=512,
|
||||
do_fuse=True,
|
||||
)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/Mistral-7B-OpenOrca-AWQ",
|
||||
quantization_config=quantization_config
|
||||
).to(0)
|
||||
```
|
||||
|
||||
The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules.
|
||||
@@ -156,14 +147,14 @@ The speed and throughput of fused and unfused modules were also tested with the
|
||||
</hfoption>
|
||||
<hfoption id="unsupported architectures">
|
||||
|
||||
For architectures that don't support fused modules yet, you need to create a custom fusing mapping to define which modules need to be fused with the `modules_to_fuse` parameter. For example, to fuse the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model.
|
||||
For architectures that don't support fused modules, create an [`AwqConfig`] and define a custom fusing mapping in `modules_to_fuse` to determine which modules need to be fused.
|
||||
|
||||
The example below fuses the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AwqConfig, AutoModelForCausalLM
|
||||
|
||||
model_id = "TheBloke/Yi-34B-AWQ"
|
||||
|
||||
quantization_config = AwqConfig(
|
||||
bits=4,
|
||||
fuse_max_seq_len=512,
|
||||
@@ -178,35 +169,46 @@ quantization_config = AwqConfig(
|
||||
}
|
||||
)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, trust_remote_code=True).to(0)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/Yi-34B-AWQ",
|
||||
quantization_config=quantization_config
|
||||
).to(0)
|
||||
```
|
||||
|
||||
The parameter `modules_to_fuse` should include:
|
||||
The parameter `modules_to_fuse` should include the following keys.
|
||||
|
||||
- `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list.
|
||||
- `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list.
|
||||
- `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers).
|
||||
- `"use_alibi"`: If your model uses ALiBi positional embedding.
|
||||
- `"num_attention_heads"`: The number of attention heads.
|
||||
- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA). If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA), otherwise GQA is used.
|
||||
- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA).
|
||||
|
||||
| parameter value | attention |
|
||||
|---|---|
|
||||
| `num_key_value_heads=num_attention_heads` | Multi-Head Attention |
|
||||
| `num_key_value_heads=1` | Multi-Query Attention |
|
||||
| `num_key_value_heads=...` | Grouped Query Attention |
|
||||
|
||||
- `"hidden_size"`: The dimension of the hidden representations.
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
## ExLlamaV2
|
||||
|
||||
|
||||
## ExLlama-v2 support
|
||||
|
||||
Recent versions of `autoawq` supports ExLlama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running:
|
||||
[ExLlamaV2](https://github.com/turboderp/exllamav2) kernels support faster prefill and decoding. Run the command below to install the latest version of autoawq with ExLlamaV2 support.
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/casper-hansen/AutoAWQ.git
|
||||
```
|
||||
|
||||
Get started by passing an `AwqConfig()` with `version="exllama"`.
|
||||
Set `version="exllama"` in [`AwqConfig`] to enable ExLlamaV2 kernels.
|
||||
|
||||
```python
|
||||
> [!TIP]
|
||||
> ExLlamaV2 is supported on AMD GPUs.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
|
||||
|
||||
@@ -217,34 +219,18 @@ model = AutoModelForCausalLM.from_pretrained(
|
||||
quantization_config=quantization_config,
|
||||
device_map="auto",
|
||||
)
|
||||
|
||||
input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
|
||||
output = model(input_ids)
|
||||
print(output.logits)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
|
||||
input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
|
||||
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
## CPU
|
||||
|
||||
Note this feature is supported on AMD GPUs.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
## Intel CPU/GPU support
|
||||
|
||||
Recent versions of autoawq supports Intel CPU/GPU with IPEX op optimizations. To get started, install the latest version of autoawq.
|
||||
[Intel Extension for PyTorch (IPEX)](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/) is designed to enable performance optimizations on Intel hardware. Run the command below to install the latest version of autoawq with IPEX support.
|
||||
|
||||
```bash
|
||||
pip install intel-extension-for-pytorch # for IPEX-GPU refer to https://intel.github.io/intel-extension-for-pytorch/xpu/2.5.10+xpu/
|
||||
pip install git+https://github.com/casper-hansen/AutoAWQ.git
|
||||
```
|
||||
|
||||
Get started by passing an `AwqConfig()` with `version="ipex"`.
|
||||
Set `version="ipex"` in [`AwqConfig`] to enable ExLlamaV2 kernels.
|
||||
|
||||
```python
|
||||
import torch
|
||||
@@ -258,20 +244,8 @@ model = AutoModelForCausalLM.from_pretrained(
|
||||
quantization_config=quantization_config,
|
||||
device_map=device,
|
||||
)
|
||||
|
||||
input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device=device)
|
||||
output = model(input_ids)
|
||||
print(output.logits)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ")
|
||||
input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(device)
|
||||
pad_token_id = tokenizer.eos_token_id
|
||||
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=pad_token_id)
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
## Resources
|
||||
|
||||
This feature is supported on Intel CPUs/GPUs.
|
||||
|
||||
</Tip>
|
||||
Run the AWQ demo [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY#scrollTo=Wwsg6nCwoThm) for more examples of how to quantize a model, push a quantized model to the Hub, and more.
|
||||
|
||||
@@ -16,60 +16,33 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# BitNet
|
||||
|
||||
[BitNet](https://arxiv.org/abs/2402.17764) replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinear with ternary (or binary in the older version) precision. The BitLinear layers introduced here quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.
|
||||
|
||||
[BitNet](https://arxiv.org/abs/2402.17764) replaces traditional linear layers in Multi-Head Attention and feed-forward networks with specialized BitLinear layers. The BitLinear layers quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.
|
||||
|
||||
<figure style="text-align: center;">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/1.58llm_extreme_quantization/bitlinear.png" alt="Alt Text" />
|
||||
<figcaption>The architecture of BitNet with BitLinear layers</figcaption>
|
||||
<figcaption>The architecture of BitNet with BitLinear layers.</figcaption>
|
||||
</figure>
|
||||
|
||||
During training, we start by quantizing the weights into ternary values, using symmetric per tensor quantization. First, we compute the average of the absolute values of the weight matrix and use this as a scale. We then divide the weights by the scale, round the values, constrain them between -1 and 1, and finally rescale them to continue in full precision.
|
||||
BitNet models can't be quantized on the fly. They need to be quantized during pretraining or fine-tuning because it is a Quantization-Aware Training (QAT) technique. During training, the weights are quantized to ternary values with symmetric per tensor quantization.
|
||||
|
||||
$$
|
||||
scale_w = \frac{1}{\frac{1}{nm} \sum_{ij} |W_{ij}|}
|
||||
$$
|
||||
1. Compute the average of the absolute values of the weight matrix and use as a scale.
|
||||
2. Divide the weights by the scale, round the values, constrain them between -1 and 1, and rescale them to continue in full precision.
|
||||
3. Activations are quantized to a specified bit-width (8-bit) using [absmax](https://arxiv.org/pdf/2208.07339) quantization (symmetric per channel quantization). This involves scaling the activations into a range of [−128,127].
|
||||
|
||||
$$
|
||||
W_q = \text{clamp}_{[-1,1]}(\text{round}(W*scale))
|
||||
$$
|
||||
Refer to this [PR](https://github.com/huggingface/nanotron/pull/180) to pretrain or fine-tune a 1.58-bit model with [Nanotron](https://github.com/huggingface/nanotron). For fine-tuning, convert a model from the Hugging Face to Nanotron format. Find the conversion steps in this [PR](https://github.com/huggingface/nanotron/pull/174).
|
||||
|
||||
$$
|
||||
W_{dequantized} = W_q*scale_w
|
||||
$$
|
||||
|
||||
Activations are then quantized to a specified bit-width (e.g., 8-bit) using [absmax](https://arxiv.org/pdf/2208.07339) quantization (symmetric per channel quantization). This involves scaling the activations into a range [−128,127[. The quantization formula is:
|
||||
|
||||
$$
|
||||
scale_x = \frac{127}{|X|_{\text{max}, \, \text{dim}=-1}}
|
||||
$$
|
||||
|
||||
$$
|
||||
X_q = \text{clamp}_{[-128,127]}(\text{round}(X*scale))
|
||||
$$
|
||||
|
||||
$$
|
||||
X_{dequantized} = X_q * scale_x
|
||||
$$
|
||||
|
||||
To learn more about how we trained, and fine-tuned bitnet models checkout the blogpost [here](https://huggingface.co/blog/1_58_llm_extreme_quantization)
|
||||
|
||||
## Load a BitNet Model from the Hub
|
||||
BitNet models can't be quantized on the fly—they need to be pre-trained or fine-tuned with the quantization applied (it's a Quantization aware training technique). Once trained, these models are already quantized and available as packed versions on the hub.
|
||||
|
||||
A quantized model can be load :
|
||||
Load a BitNet quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
path = "/path/to/model"
|
||||
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto")
|
||||
```
|
||||
## Pre-training / Fine-tuning a BitNet Model
|
||||
|
||||
If you're looking to pre-train or fine-tune your own 1.58-bit model using Nanotron, check out this [PR](https://github.com/huggingface/nanotron/pull/180), all you need to get started is there !
|
||||
|
||||
For fine-tuning, you'll need to convert the model from Hugging Face format to Nanotron format (which has some differences). You can find the conversion steps in this [PR](https://github.com/huggingface/nanotron/pull/174).
|
||||
|
||||
## Kernels
|
||||
|
||||
In our initial version, we chose to use `@torch.compile` to unpack the weights and perform the forward pass. It’s very straightforward to implement and delivers significant speed improvements. We plan to integrate additional optimized kernels in future versions.
|
||||
`@torch.compile` is used to unpack the weights and perform the forward pass. It’s very straightforward to implement and delivers significant speed improvements. Additional optimized kernels will be integrated in future versions.
|
||||
|
||||
## Resources
|
||||
|
||||
Read [Fine-tuning LLMs to 1.58bit: extreme quantization made easy](https://huggingface.co/blog/1_58_llm_extreme_quantization) to learn more about how BitNet models are trained and fine-tuned.
|
||||
|
||||
@@ -16,42 +16,24 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# bitsandbytes
|
||||
|
||||
[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
|
||||
[bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) features the LLM.int8 and QLoRA quantization to enable accessible large language model inference and training.
|
||||
|
||||
To use bitsandbytes, make sure you have the following libraries installed:
|
||||
[LLM.int8()](https://hf.co/papers/2208.07339) is a quantization method that aims to make large language model inference more accessible without significant degradation. Unlike naive 8-bit quantization, which can result in loss of critical information and accuracy, LLM.int8() dynamically adapts to ensure sensitive components of the computation retain higher precision when needed.
|
||||
|
||||
QLoRA, or 4-bit quantization, compresses a model even further to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allowing training.
|
||||
|
||||
Run the command below to install bitsandbytes.
|
||||
|
||||
```bash
|
||||
pip install --upgrade transformers accelerate bitsandbytes
|
||||
```
|
||||
|
||||
Quantize a model by passing a [`BitsAndBytesConfig`] to [`~PreTrainedModel.from_pretrained`]. This works for any model in any modality, as long as it supports [Accelerate](https://huggingface.co/docs/accelerate/index) and contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers.
|
||||
|
||||
<hfoptions id="bnb">
|
||||
<hfoption id="8-bit">
|
||||
|
||||
```bash
|
||||
pip install transformers accelerate bitsandbytes>0.37.0
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="4-bit">
|
||||
|
||||
```bash
|
||||
pip install bitsandbytes>=0.39.0
|
||||
pip install --upgrade accelerate transformers
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
<Tip>
|
||||
|
||||
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
Now you can quantize a model by passing a `BitsAndBytesConfig` to [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers.
|
||||
|
||||
<hfoptions id="bnb">
|
||||
<hfoption id="8-bit">
|
||||
|
||||
Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently use the GPUs available:
|
||||
Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
@@ -64,7 +46,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
)
|
||||
```
|
||||
|
||||
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
By default, all other modules such as [torch.nn.LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) are set to the default torch dtype. You can change the data type of these modules with the `torch_dtype` parameter. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -80,7 +62,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
|
||||
```
|
||||
|
||||
Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with the [`~PreTrainedModel.push_to_hub`] method. The quantization config.json file is pushed first, followed by the quantized model weights.
|
||||
Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with [`~PreTrainedModel.push_to_hub`]. The quantization config.json file is pushed first, followed by the quantized model weights.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
||||
@@ -99,7 +81,7 @@ model.push_to_hub("bloom-560m-8bit")
|
||||
</hfoption>
|
||||
<hfoption id="4-bit">
|
||||
|
||||
Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently use the GPUs available:
|
||||
Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
@@ -112,7 +94,7 @@ model_4bit = AutoModelForCausalLM.from_pretrained(
|
||||
)
|
||||
```
|
||||
|
||||
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
By default, all other modules such as [torch.nn.LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -128,24 +110,21 @@ model_4bit = AutoModelForCausalLM.from_pretrained(
|
||||
model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
|
||||
```
|
||||
|
||||
If you have `bitsandbytes>=0.41.3`, you can serialize 4-bit models and push them on Hugging Face Hub. Simply call `model.push_to_hub()` after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with `model.save_pretrained()` command.
|
||||
Make sure you have the latest bitsandbytes version so you can serialize 4-bit models and push them to the Hub with [`~PreTrainedModel.push_to_hub`]. Use [`~PreTrainedModel.save_pretrained`] to save the 4-bit model locally.
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
<Tip warning={true}>
|
||||
> [!WARNING]
|
||||
> 8 and 4-bit training is only supported for training *extra* parameters.
|
||||
|
||||
Training with 8-bit and 4-bit weights are only supported for training *extra* parameters.
|
||||
|
||||
</Tip>
|
||||
|
||||
You can check your memory footprint with the `get_memory_footprint` method:
|
||||
Check your memory footprint with `get_memory_footprint`.
|
||||
|
||||
```py
|
||||
print(model.get_memory_footprint())
|
||||
```
|
||||
|
||||
Quantized models can be loaded from the [`~PreTrainedModel.from_pretrained`] method without needing to specify the `load_in_8bit` or `load_in_4bit` parameters:
|
||||
Load quantized models with [`~PreTrainedModel.from_pretrained`] without a `quantization_config`.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
@@ -153,19 +132,13 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto")
|
||||
```
|
||||
|
||||
## 8-bit (LLM.int8() algorithm)
|
||||
## LLM.int8
|
||||
|
||||
<Tip>
|
||||
|
||||
Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)!
|
||||
|
||||
</Tip>
|
||||
|
||||
This section explores some of the specific features of 8-bit models, such as offloading, outlier thresholds, skipping module conversion, and finetuning.
|
||||
This section explores some of the specific features of 8-bit quantization, such as offloading, outlier thresholds, skipping module conversion, and finetuning.
|
||||
|
||||
### Offloading
|
||||
|
||||
8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in **float32**, and aren't converted to 8-bit. For example, to enable offloading for the [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) model, start by creating a [`BitsAndBytesConfig`]:
|
||||
8-bit models can offload weights between the CPU and GPU to fit very large models into memory. The weights dispatched to the CPU are stored in **float32** and aren't converted to 8-bit. For example, enable offloading for [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) through [`BitsAndBytesConfig`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
@@ -173,7 +146,7 @@ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
|
||||
```
|
||||
|
||||
Design a custom device map to fit everything on your GPU except for the `lm_head`, which you'll dispatch to the CPU:
|
||||
Design a custom device map to fit everything on your GPU except for the `lm_head`, which is dispatched to the CPU.
|
||||
|
||||
```py
|
||||
device_map = {
|
||||
@@ -185,7 +158,7 @@ device_map = {
|
||||
}
|
||||
```
|
||||
|
||||
Now load your model with the custom `device_map` and `quantization_config`:
|
||||
Now load your model with the custom `device_map` and `quantization_config`.
|
||||
|
||||
```py
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
@@ -200,7 +173,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
|
||||
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
|
||||
|
||||
To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:
|
||||
To find the best threshold for your model, experiment with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]. For example, setting the threshold to `0.0` significantly speeds up inference at the potential cost of some accuracy loss.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
@@ -208,7 +181,7 @@ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
model_id = "bigscience/bloom-1b7"
|
||||
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
llm_int8_threshold=10.0,
|
||||
llm_int8_threshold=0.0,
|
||||
llm_int8_enable_fp32_cpu_offload=True
|
||||
)
|
||||
|
||||
@@ -222,7 +195,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
|
||||
### Skip module conversion
|
||||
|
||||
For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit which can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
|
||||
For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit because it can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
||||
@@ -243,22 +216,15 @@ model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
|
||||
### Finetuning
|
||||
|
||||
With the [PEFT](https://github.com/huggingface/peft) library, you can finetune large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it'll automatically load your model on a GPU. However, you can still customize the device map with the `device_map` parameter if you want to (`device_map="auto"` should only be used for inference).
|
||||
The [PEFT](https://github.com/huggingface/peft) library supports fine-tuning large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it automatically loads your model on a GPU. However, you can still customize the device map with the `device_map` parameter (`device_map="auto"` should only be used for inference).
|
||||
|
||||
## 4-bit (QLoRA algorithm)
|
||||
|
||||
<Tip>
|
||||
|
||||
Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
|
||||
|
||||
</Tip>
|
||||
|
||||
This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
|
||||
## QLoRA
|
||||
|
||||
This section explores some of the specific features of 4-bit quantization, such as changing the compute data type, the Normal Float 4 (NF4) data type, and nested quantization.
|
||||
|
||||
### Compute data type
|
||||
|
||||
To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
|
||||
Change the data type from float32 (the default value) to bf16 in [`BitsAndBytesConfig`] to speedup computation.
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -269,7 +235,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dty
|
||||
|
||||
### Normal Float 4 (NF4)
|
||||
|
||||
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
|
||||
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models.
|
||||
|
||||
```py
|
||||
from transformers import BitsAndBytesConfig
|
||||
@@ -286,7 +252,7 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa
|
||||
|
||||
### Nested quantization
|
||||
|
||||
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
|
||||
Nested quantization can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enable gradient accumulation with 4 steps.
|
||||
|
||||
```py
|
||||
from transformers import BitsAndBytesConfig
|
||||
@@ -299,22 +265,19 @@ double_quant_config = BitsAndBytesConfig(
|
||||
model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-chat-hf", torch_dtype="auto", quantization_config=double_quant_config)
|
||||
```
|
||||
|
||||
## Dequantizing `bitsandbytes` models
|
||||
## Dequantizing bitsandbytes models
|
||||
|
||||
Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model.
|
||||
Once quantized, you can [`~PreTrainedModel.dequantize`] a model to the original precision but this may result in some quality loss. Make sure you have enough GPU memory to fit the dequantized model.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
|
||||
|
||||
model_id = "facebook/opt-125m"
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=BitsAndBytesConfig(load_in_4bit=True))
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", BitsAndBytesConfig(load_in_4bit=True))
|
||||
model.dequantize()
|
||||
```
|
||||
|
||||
text = tokenizer("Hello my name is", return_tensors="pt").to(0)
|
||||
## Resources
|
||||
|
||||
out = model.generate(**text)
|
||||
print(tokenizer.decode(out[0]))
|
||||
```
|
||||
Learn more about the details of 8-bit quantization in [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration).
|
||||
|
||||
Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
|
||||
|
||||
@@ -13,98 +13,61 @@ specific language governing permissions and limitations under the License.
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
# Compressed Tensors
|
||||
|
||||
The [`compressed-tensors`](https://github.com/neuralmagic/compressed-tensors) library provides a versatile and efficient way to store and manage compressed model checkpoints. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.
|
||||
# compressed-tensors
|
||||
|
||||
Some of the supported formats include:
|
||||
1. `dense`
|
||||
2. `int-quantized` ([sample](https://huggingface.co/nm-testing/tinyllama-w8a8-compressed-hf-quantizer)): INT8 quantized models
|
||||
3. `float-quantized` ([sample](https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat)): FP8 quantized models; currently support E4M3
|
||||
4. `pack-quantized` ([sample](https://huggingface.co/nm-testing/tinyllama-w4a16-compressed-hf-quantizer)): INT4 or INT8 weight-quantized models, packed into INT32. For INT4, the weights have an INT4 range but are stored as INT8 and then packed into INT32.
|
||||
[compressed-tensors](https://github.com/neuralmagic/compressed-tensors) extends [safetensors](https://github.com/huggingface/safetensors) files to compressed tensor data types to provide a unified checkpoint format for storing and loading various quantization and sparsity formats such dense, int-quantized (int8), float-quantized (fp8), and pack-quantized (int4 or int8 weight-quantized packed into int32).
|
||||
|
||||
Compressed models can be easily created using [llm-compressor](https://github.com/vllm-project/llm-compressor).
|
||||
Alternatively models can be created independently and serialized with a compressed tensors config.
|
||||
compressed-tensors supports fine-tuning with [PEFT](https://huggingface.co/docs/peft) and includes the following features as well.
|
||||
|
||||
To find existing models on the Hugging Face Model Hub, search for the [`compressed-tensors` tag](https://huggingface.co/models?other=compressed-tensors).
|
||||
- fp8, int4, int8 weight and activation precisions.
|
||||
- Quantization scales and zero-points strategies for [tensor, channel, group, block, token](https://github.com/neuralmagic/compressed-tensors/blob/83b2e7a969d70606421a76b9a3d112646077c8de/src/compressed_tensors/quantization/quant_args.py#L43-L52).
|
||||
- Dynamic per-token activation quantization (or any static strategy).
|
||||
- Weight sparsity (unstructured or semi-structured like 2:4) can be composed with quantization for extreme compression.
|
||||
- Quantization of arbitrary modules, not just [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules.
|
||||
- Targeted support for specific modules by name or class.
|
||||
|
||||
#### Features:
|
||||
- Weight and activation precisions: FP8, INT4, INT8 (for Q/DQ arbitrary precision is allowed for INT)
|
||||
- Quantization scales and zero-points strategies: [tensor, channel, group, block, token](https://github.com/neuralmagic/compressed-tensors/blob/83b2e7a969d70606421a76b9a3d112646077c8de/src/compressed_tensors/quantization/quant_args.py#L43-L52)
|
||||
- Dynamic per-token activation quantization (or any static strategy)
|
||||
- Sparsity in weights (unstructured or semi-structured like 2:4) can be composed with quantization for extreme compression
|
||||
- Supports quantization of arbitrary modules, not just Linear modules
|
||||
- Targeted support or ignoring of modules by name or class
|
||||
Install compressed-tensors from [PyPI](https://pypi.org/project/compressed-tensors) to get the latest stable release (recommended) or install it from source to get the latest features.
|
||||
|
||||
## Installation
|
||||
<hfoptions id="install">
|
||||
<hfoption id="PyPI">
|
||||
|
||||
It is recommended to install stable releases of compressed-tensors from [PyPI](https://pypi.org/project/compressed-tensors):
|
||||
```bash
|
||||
pip install compressed-tensors
|
||||
```
|
||||
|
||||
Developers who want to experiment with the latest features can also install the package from source:
|
||||
</hfoption>
|
||||
<hfoption id="source code">
|
||||
|
||||
```bash
|
||||
git clone https://github.com/neuralmagic/compressed-tensors
|
||||
cd compressed-tensors
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## Quickstart Model Load
|
||||
Quantized models can be easily loaded for inference as shown below. Only models that have already been quantized can be loaded at the moment. To quantize a model into the compressed-tensors format see [llm-compressor](https://github.com/vllm-project/llm-compressor).
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Search using the compressed-tensors [tag](https://huggingface.co/models?other=compressed-tensors) to find a compatible model on the Hugging Face Hub.
|
||||
|
||||
Only models that have already been quantized can be loaded at the moment, and once a model is loaded, it cannot be saved. To quantize a model into the compressed-tensors format, see [llm-compressor](https://github.com/vllm-project/llm-compressor). Alternatively, models can be created independently and serizlied with a compressed-tensors config.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
# Load the model in compressed-tensors format
|
||||
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
|
||||
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf", device_map="auto")
|
||||
|
||||
# Measure memory usage
|
||||
# measure memory usage
|
||||
mem_params = sum([param.nelement()*param.element_size() for param in ct_model.parameters()])
|
||||
print(f"{mem_params/2**30:.4f} GB")
|
||||
# 8.4575 GB
|
||||
```
|
||||
|
||||
We can see just above that the compressed-tensors FP8 checkpoint of Llama 3.1 8B is able to be loaded for inference using half of the memory of the unquantized reference checkpoint.
|
||||
## Model checkpoint
|
||||
|
||||
## Sample Use Cases - Load and run an FP8 model
|
||||
compressed-tensor models are defined through its configuration entry. The following example is taken from the [nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf/blob/main/config.json) `config.json` file.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
prompt = [
|
||||
"Hello, my name is",
|
||||
"The capital of France is",
|
||||
"The future of AI is"
|
||||
]
|
||||
|
||||
model_name = "nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat"
|
||||
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
|
||||
inputs = tokenizer(prompt, return_tensors="pt")
|
||||
generated_ids = quantized_model.generate(**inputs, max_length=50, do_sample=False)
|
||||
outputs = tokenizer.batch_decode(generated_ids)
|
||||
|
||||
print(outputs)
|
||||
|
||||
"""
|
||||
['<|begin_of_text|>Hello, my name is [Name]. I am a [Your Profession/Student] and I am here to learn about the [Course/Program] at [University/Institution]. I am excited to be here and I am looking forward to', '<|begin_of_text|>The capital of France is Paris, which is located in the north-central part of the country. Paris is the most populous city in France and is known for its stunning architecture, art museums, fashion, and romantic atmosphere. The city is home to', "<|begin_of_text|>The future of AI is here, and it's already changing the way we live and work. From virtual assistants to self-driving cars, AI is transforming industries and revolutionizing the way we interact with technology. But what does the future of AI hold"]
|
||||
"""
|
||||
|
||||
```
|
||||
|
||||
The above shows a quick example for running generation using a `compressed-tensors`
|
||||
model. Currently, once loaded the model cannot be saved.
|
||||
|
||||
## Deep dive into a compressed-tensors model checkpoint
|
||||
|
||||
In this example we will examine how the compressed-tensors model nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf is defined through its configuration entry and see how this translates to the loaded model representation.
|
||||
|
||||
First, let us look at the [`quantization_config` of the model](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf/blob/main/config.json). At a glance it looks overwhelming with the number of entries but this is because compressed-tensors is a format that allows for flexible expression both during and after model compression.
|
||||
|
||||
In practice for checkpoint loading and inference the configuration can be simplified to not include all the default or empty entries, so we will do that here to focus on what compression is actually represented.
|
||||
There are a lot of entries to allow for flexible expression both during and after compression, but the entries for loading and inference can be simplified to focus on just a few key entries.
|
||||
|
||||
```yaml
|
||||
"quantization_config": {
|
||||
@@ -130,9 +93,9 @@ In practice for checkpoint loading and inference the configuration can be simpli
|
||||
},
|
||||
```
|
||||
|
||||
We can see from the above configuration that it is specifying one config group that includes weight and activation quantization to FP8 with a static per-tensor strategy. It is also worth noting that in the `ignore` list there is an entry to skip quantization of the `lm_head` module, so that module should be untouched in the checkpoint.
|
||||
The config file specifies the quantization of a config group (`group_0`), which includes weight and activation quantization to fp8 with a static per-tensor strategy. The `lm_head` module is unquantized as shown in the `ignore` key.
|
||||
|
||||
To see the result of the configuration in practice, we can simply use the [safetensors viewer](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf?show_file_info=model.safetensors.index.json) on the model card to see the quantized weights, input_scale, and weight_scale for all of the Linear modules in the first model layer (and so on for the rest of the layers).
|
||||
For a more detailed look at the model weights, use the [safetensors viewer](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf?show_file_info=model.safetensors.index.json) on the model card to see the quantized weights, input scale, and weight scale for all [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules.
|
||||
|
||||
| Tensors | Shape | Precision |
|
||||
| ------- | ----- | --------- |
|
||||
@@ -160,7 +123,7 @@ model.layers.0.self_attn.v_proj.input_scale | [1] | BF16
|
||||
model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3
|
||||
model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16
|
||||
|
||||
When we load the model with the compressed-tensors HFQuantizer integration, we can see that all of the Linear modules that are specified within the quantization configuration have been replaced by `CompressedLinear` modules that manage the compressed weights and forward pass for inference. Note that the `lm_head` mentioned before in the ignore list is still kept as an unquantized Linear module.
|
||||
When loading a compressed-tensors model with the [`~quantizers.HFQuantizer`] integration, all the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules specified in the quantization config are replaced by [CompressedLinear](https://github.com/neuralmagic/compressed-tensors/blob/975cb223b19fcac2b98a4271d17668462d4d6e1d/src/compressed_tensors/linear/compressed_linear.py#L30) modules that manage the compressed weights and forward pass for inference. The `lm_head` module is still kept as an unquantized nn.Linear module.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
@@ -14,56 +14,58 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Contribute new quantization method
|
||||
# Contribute
|
||||
|
||||
Transformers supports and integrates many quantization methods such as QLoRA, GPTQ, LLM.int8, and AWQ. However, there are other quantization approaches that are not yet integrated. To make adding and using these quantization methods with Transformers models easier, you should use the [`HfQuantizer`] class. The [`HfQuantizer`] is designed as an internal helper class for adding a quantization method instead of something you apply to every PyTorch module.
|
||||
Transformers supports many quantization methods such as QLoRA, GPTQ, LLM.int8, and AWQ. However, there are still many more quantization approaches that haven't been integrated yet. To make adding and using these quantization methods with Transformers easier, use the [`~quantizers.HfQuantizer`] class. [`~quantizers.HfQuantizer`] is designed to be an internal helper class for adding a quantization method instead of something applied to every PyTorch module.
|
||||
|
||||
This guide will show you how to integrate a new quantization method with the [`HfQuantizer`] class.
|
||||
This guide will show you how to integrate a new quantization method with [`~quantizers.HfQuantizer`].
|
||||
|
||||
## Requirements
|
||||
|
||||
Before integrating a new quantization method into Transformers, ensure the method you are trying to add meets the following prerequisites. Only quantization methods that can be run with PyTorch modules are currently supported.
|
||||
Before integrating a new quantization method into Transformers, ensure the method meets the following requirements. Only quantization methods that can be run with PyTorch modules are supported.
|
||||
|
||||
- The quantization method is available through a Python package that is pip-installable by anyone (it is also fine if you can only install the package from source). Ideally, pre-compiled kernels are included in the pip package.
|
||||
- The method can run on commonly-used hardware (CPU, GPU, ...).
|
||||
- The method is wrapped in a `nn.Module` (e.g., `Linear8bitLt`, `Linear4bit`), and the quantized linear layer should have the following definition:
|
||||
- The quantization method is available through a Python package that is pip-installable (it is also fine if you can only install the package from source). Ideally, pre-compiled kernels are included in the pip package.
|
||||
- The method can run on commonly-used hardware (CPU, GPU, etc.).
|
||||
- The method is wrapped in a [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) ([`~bitsandbytes.nn.Linear8bitLt`], [`~bitsandbytes.nn.Linear4bit`]), and the quantized linear layer should have the following definition.
|
||||
|
||||
```py
|
||||
class Linear4bit(nn.Module):
|
||||
def __init__(self, ...):
|
||||
...
|
||||
|
||||
def forward(self, x):
|
||||
return my_4bit_kernel(x, self.weight, self.bias)
|
||||
```
|
||||
```py
|
||||
class Linear4bit(nn.Module):
|
||||
def __init__(self, ...):
|
||||
...
|
||||
|
||||
def forward(self, x):
|
||||
return my_4bit_kernel(x, self.weight, self.bias)
|
||||
```
|
||||
|
||||
This way, Transformers models can be easily quantized by replacing some instances of `nn.Linear` with a target class.
|
||||
This way, Transformers models are easily quantized by replacing instances of [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) with a target class.
|
||||
|
||||
- The quantization method should be serializable. You can save the quantized weights locally or push them to the Hub.
|
||||
- Make sure the package that contains the quantization kernels/primitive is stable (no frequent breaking changes).
|
||||
- Make sure the package containing the quantization kernels/primitive is stable (no frequent breaking changes).
|
||||
|
||||
For some quantization methods, they may require "pre-quantizing" the models through data calibration (e.g., AWQ). In this case, we prefer to only support inference in Transformers and let the third-party library maintained by the ML community deal with the model quantization itself.
|
||||
Some quantization methods may require "pre-quantizing" the model through data calibration (AWQ). In this case, we prefer to only support inference in Transformers and let the third-party library maintained by the ML community deal handle the model quantization itself.
|
||||
|
||||
## Build a new HFQuantizer class
|
||||
## Create new HFQuantizer class
|
||||
|
||||
1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py) and make sure to expose the new quantization config inside Transformers main `init` by adding it to the [`_import_structure`](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) object of [src/transformers/__init__.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py).
|
||||
1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py). Add the new quantization config to the [_import_structure](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) inside Transformers' [src/transformers/__init__.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py) file.
|
||||
|
||||
2. Create a new file inside [src/transformers/quantizers/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers) named `quantizer_your_method.py`, and make it inherit from [src/transformers/quantizers/base.py::HfQuantizer](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/base.py#L28). Make sure to add the new quantizer and quantization config in the quantization auto-mapping in [src/transformers/quantizers/auto.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/auto.py).
|
||||
2. Create a new file inside [src/transformers/quantizers/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers) named `quantizer_your_method.py`, and make it inherit from [`~quantizers.HfQuantizer]. Make sure to add the new quantizer and quantization config in the quantization auto-mapping in [src/transformers/quantizers/auto.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/auto.py).
|
||||
|
||||
3. Define the following class attributes/property methods for your quantization method:
|
||||
3. Define the following class attributes and property methods for your quantization method.
|
||||
|
||||
* `requires_calibration`: Whether the quantization method requires a data calibration process. If set to `True`, you can only support inference (with quantized weights) and not inference and quantization.
|
||||
* `required_packages`: A list of strings of the required packages to use the quantized weights. You might need to define some new utility methods such as `is_auto_awq_available` in [transformers/src/utils/import_utils.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/import_utils.py).
|
||||
* `requires_parameters_quantization`: Only required if your quantization method requires extra attention to the underlying `nn.Parameter` object. For example, bitsandbytes uses `Params4bit` and `Int8Param`, which requires some extra attention when quantizing the model. Most of the recent quantization method packs int2/int4 weights inside `torch.uint8` weights, so this flag should not be really required (set to `False` by default).
|
||||
* `is_serializable`: A property method to determine whether the method is serializable or not.
|
||||
* `is_trainable`: A property method to determine whether you can fine-tune models on top of the quantization method (with or without PEFT approaches).
|
||||
- `requires_calibration`: Whether the quantization method requires a data calibration process. If set to `True`, you can only support inference (with quantized weights) and not inference and quantization.
|
||||
- `required_packages`: A list of strings of the required packages to use the quantized weights. You might need to define some new utility methods such as `is_auto_awq_available` in [transformers/src/utils/import_utils.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/import_utils.py).
|
||||
- `requires_parameters_quantization`: Only required if your quantization method requires extra attention to the underlying [nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html) object. For example, bitsandbytes uses [`~bitsandbytes.nn.Params4bit`] and [`~bitsandbytes.nn.Int8Params`], which requires some extra attention when quantizing the model. Most of the recent quantization method packs int2 and int4 weights inside [torch.uint8](https://pytorch.org/docs/stable/tensors.html) weights, so this flag should not be really required (set to `False` by default).
|
||||
- `is_serializable`: A property method to determine whether the method is serializable or not.
|
||||
- `is_trainable`: A property method to determine whether you can fine-tune models on top of the quantization method (with or without PEFT approaches).
|
||||
|
||||
4. Write the `validate_environment` and `update_torch_dtype` methods. These methods are called before creating the quantized model to ensure users use the right configuration. You can have a look at how this is done on other quantizers.
|
||||
4. Write the `validate_environment` and `update_torch_dtype` methods. These methods are called before creating the quantized model to ensure users use the right configuration. Refer to other quantizers for an example of it is implemented.
|
||||
|
||||
5. Write the `_process_model_before_weight_loading` method. In Transformers, the quantized models are initialized first on the `"meta"` device before loading the weights. This means the `_process_model_before_weight_loading` method takes care of manipulating the model skeleton to replace some modules (e.g., `nn.Linear`) with the target modules (quantization modules). You can define a module replacement logic or any other utility method by creating a new file in [transformers/src/integrations/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/integrations) and exposing the relevant methods in that folder's `__init__.py` file. The best starting point would be to have a look at another quantization methods such as [quantizer_awq.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/quantizer_awq.py).
|
||||
5. Write the `_process_model_before_weight_loading` method. In Transformers, the quantized models are initialized first on the `"meta"` device before loading the weights. This means the `_process_model_before_weight_loading` method takes care of manipulating the model skeleton to replace some modules ([nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)) with the target modules (quantization modules).
|
||||
|
||||
You can define module replacement logic or any other utility method by creating a new file in [transformers/src/integrations/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/integrations) and exposing the relevant methods in that folder's `__init__.py` file. The best starting point would be to have a look at another quantization method such as [quantizer_awq.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/quantizer_awq.py).
|
||||
|
||||
6. Write the `_process_model_after_weight_loading` method. This method enables implementing additional features that require manipulating the model after loading the weights.
|
||||
|
||||
7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization` and adding a new row in the table in `docs/source/en/quantization/overview.md`.
|
||||
7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization`.
|
||||
|
||||
8. Add tests! You should add tests by first adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out how it is implemented for other quantization methods.
|
||||
8. You should add tests by adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out existing quantization methods to see how it is implemented.
|
||||
|
||||
@@ -16,32 +16,50 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# EETQ
|
||||
|
||||
The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization.
|
||||
The [Easy & Efficient Quantization for Transformers (EETQ)](https://github.com/NetEase-FuXi/EETQ) library supports int8 weight-only per-channel quantization for NVIDIA GPUs. It uses high-performance GEMM and GEMV kernels from [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). The attention layer is optimized with [FlashAttention2](https://github.com/Dao-AILab/flash-attention). No calibration dataset is required, and the model doesn't need to be pre-quantized. Accuracy degradation is negligible owing to the per-channel quantization.
|
||||
|
||||
Make sure you have eetq installed from the [release page](https://github.com/NetEase-FuXi/EETQ/releases)
|
||||
```
|
||||
EETQ further supports fine-tuning with [PEFT](https://huggingface.co/docs/peft).
|
||||
|
||||
Install EETQ from the [release page](https://github.com/NetEase-FuXi/EETQ/releases) or [source code](https://github.com/NetEase-FuXi/EETQ). CUDA 11.4+ is required for EETQ.
|
||||
|
||||
<hfoptions id="install">
|
||||
<hfoption id="release page">
|
||||
|
||||
```bash
|
||||
pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl
|
||||
```
|
||||
or via the source code https://github.com/NetEase-FuXi/EETQ. EETQ requires CUDA capability <= 8.9 and >= 7.0
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="source code">
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NetEase-FuXi/EETQ.git
|
||||
cd EETQ/
|
||||
git submodule update --init --recursive
|
||||
pip install .
|
||||
```
|
||||
|
||||
An unquantized model can be quantized via "from_pretrained".
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Quantize a model on-the-fly by defining the quantization data type in [`EetqConfig`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, EetqConfig
|
||||
path = "/path/to/model"
|
||||
|
||||
quantization_config = EetqConfig("int8")
|
||||
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
|
||||
Save the quantized model with [`~PreTrainedModel.save_pretrained`] so it can be reused again with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
quant_path = "/path/to/save/quantized/model"
|
||||
model.save_pretrained(quant_path)
|
||||
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
|
||||
```
|
||||
```
|
||||
|
||||
@@ -14,46 +14,43 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# FBGEMM FP8
|
||||
# FBGEMM
|
||||
|
||||
With FBGEMM FP8 quantization method, you can quantize your model in FP8 (W8A8):
|
||||
- the weights will be quantized in 8bit (FP8) per channel
|
||||
- the activation will be quantized in 8bit (FP8) per token
|
||||
|
||||
It relies on the [FBGEMM](https://github.com/pytorch/FBGEMM) library which provides efficient low-precision general matrix multiplication for small batch sizes and support for accuracy-loss minimizing techniques such as row-wise quantization and outlier-aware quantization.
|
||||
[FBGEMM (Facebook GEneral Matrix Multiplication)](https://github.com/pytorch/FBGEMM) is a low-precision matrix multiplication library for small batch sizes and support for accuracy-loss minimizing techniques such as row-wise quantization and outlier-aware quantization. With FBGEMM, quantize a models weights to 8-bits/channel and the activations to 8-bits/token (also known as fp8 or w8a8).
|
||||
|
||||
> [!TIP]
|
||||
> You need a GPU with compute capability>=9 (e.g. H100)
|
||||
> You need a GPU with [compute capability 9+](https://developer.nvidia.com/cuda-gpus#collapseOne) like a H100.
|
||||
|
||||
Before you begin, make sure the following libraries are installed with their latest version:
|
||||
Install the FBGEMM_GPU package with the command below to ensure you have the latest version.
|
||||
|
||||
```bash
|
||||
pip install --upgrade accelerate fbgemm-gpu torch
|
||||
```
|
||||
|
||||
If you are having issues with fbgemm-gpu and torch library, you might need to install the nightly release. You can follow the instruction [here](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch)
|
||||
If you're having installation issues, try installing the [nightly release](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch).
|
||||
|
||||
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
|
||||
Create a [`FbgemmFp8Config`] and pass it to [`~PreTrainedModel.from_pretrained`] to quantize a model to fp8.
|
||||
|
||||
```py
|
||||
from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer
|
||||
from transformers import FbgemmFp8Config, AutoModelForCausalLM
|
||||
|
||||
model_name = "meta-llama/Meta-Llama-3-8B"
|
||||
quantization_config = FbgemmFp8Config()
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10)
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Meta-Llama-3-8B",
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
|
||||
[`~PreTrainedModel.save_pretrained`] and [`~PreTrainedModel.from_pretrained`] enable saving and loading a quantized model.
|
||||
|
||||
```py
|
||||
quant_path = "/path/to/save/quantized/model"
|
||||
model.save_pretrained(quant_path)
|
||||
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
|
||||
```
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
Read the [Open-sourcing FBGEMM for state-of-the-art server-side inference](https://engineering.fb.com/2018/11/07/ml-applications/fbgemm/) blog post for more details on FBGEMM.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
@@ -16,27 +16,27 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# Fine-grained FP8
|
||||
|
||||
With FP8 quantization method, you can quantize your model in FP8 (W8A8):
|
||||
- the weights will be quantized in 8bit (FP8) per 2D block (e.g. weight_block_size=(128, 128)) which is inspired from the deepseek implementation
|
||||
- Activations are quantized to 8 bits (FP8) per group per token, with the group value matching that of the weights in the input channels (128 by default)
|
||||
Fine-grained FP8 quantization quantizes the weights and activations to fp8.
|
||||
|
||||
It's implemented to add support for DeepSeek-V3 and DeepSeek-R1 models, you can see the paper [here](https://arxiv.org/pdf/2412.19437), and the image below explains the quantization scheme :
|
||||
- The weights are quantized to 8-bits for each 2D block (`weight_block_size=(128, 128)`).
|
||||
- The activations are quantized to 8-bits for each group per token. The group value matches the weights in the input channel (128 by default).
|
||||
|
||||

|
||||
FP8 quantization enables support for [DeepSeek-V3](https://hf.co/papers/2412.19437) and DeepSeek-R1.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/b7b3b34bf826a6423ea82ffc57ecac80c46c3c76/transformers/quantization/quantization_deepseek.png">
|
||||
</div>
|
||||
|
||||
> [!TIP]
|
||||
> You need a GPU with compute capability>=9 (e.g. H100)
|
||||
> You need a GPU with Compute Capability>=9 (H100), and install a PyTorch version compatible with the CUDA version of your GPU.
|
||||
|
||||
Before you begin, make sure the following libraries are installed with their latest version:
|
||||
Install Accelerate and upgrade to the latest version of PyTorch.
|
||||
|
||||
```bash
|
||||
pip install --upgrade accelerate torch
|
||||
```
|
||||
> [!TIP]
|
||||
> You need to install a torch version compatible with the cuda version of your GPU.
|
||||
|
||||
|
||||
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
|
||||
Create a [`FineGrainedFP8Config`] class and pass it to [`~PreTrainedModel.from_pretrained`] to quantize it. The weights are loaded in full precision (`torch.float32`) by default regardless of the actual data type the weights are stored in. Set `torch_dtype="auto"` to load the weights in the data type defined in a models `config.json` file to automatically load the most memory-optiomal data type.
|
||||
|
||||
```py
|
||||
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
|
||||
@@ -53,7 +53,7 @@ output = quantized_model.generate(**input_ids, max_new_tokens=10)
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
|
||||
Use [`~PreTrainedModel.save_pretrained`] to save the quantized model and reload it with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
quant_path = "/path/to/save/quantized/model"
|
||||
|
||||
@@ -16,91 +16,80 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# GPTQ
|
||||
|
||||
<Tip>
|
||||
The [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save memory usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory. Inference is also faster because a lower bitwidth takes less time to communicate.
|
||||
|
||||
Try GPTQ quantization with PEFT in this [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) and learn more about it's details in this [blog post](https://huggingface.co/blog/gptq-integration)!
|
||||
> [!WARNING]
|
||||
> AutoGPTQ is likely to be deprecated in the future due to lack of continued support for new models and features. See the [GPTQModel](#gptqmodel) section for more details.
|
||||
|
||||
</Tip>
|
||||
|
||||
Both [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) libraries implement the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes error. These weights are quantized to int4, stored as int32 (int4 x 8) and dequantized (restored) to fp16 on the fly during inference. This can save memory by almost 4x because the int4 weights are often dequantized in a fused kernel. You can also expect a substantial speedup in inference due to lower bandwidth requirements for lower bitwidth.
|
||||
|
||||
[GPTQModel](https://github.com/ModelCloud/GPTQModel) started as a maintained fork of AutoGPTQ but has since differentiated itself with the following major differences.
|
||||
|
||||
* Model support: GPTQModel continues to support all of the latest LLM models.
|
||||
* Multimodal support: GPTQModel supports accurate quantization of Qwen 2-VL and Ovis 1.6-VL image-to-text models.
|
||||
* Platform support: Linux, macOS (Apple Silicon), and Windows 11.
|
||||
* Hardware support: NVIDIA CUDA, AMD ROCm, Apple Silicon M1/MPS /CPU, Intel/AMD CPU, and Intel Datacenter Max/Arc GPUs.
|
||||
* Asymmetric support: Asymmetric quantization can potentially introduce lower quantization errors compared to symmetric quantization. However, it is not backward compatible with AutoGPTQ, and not all kernels, such as Marlin, support asymmetric quantization.
|
||||
* IPEX kernel for Intel/AMD accelerated CPU and Intel GPU (Datacenter Max/Arc GPUs) support.
|
||||
* Updated Marlin kernel from Neural Magic optimized for A100 (Ampere).
|
||||
* Updated kernels with auto-padding for legacy model support and models with non-uniform in/out-features.
|
||||
* Faster quantization, lower memory usage, and more accurate default quantization via GPTQModel quantization APIs.
|
||||
* User and developer friendly APIs.
|
||||
|
||||
|
||||
[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) will likely be deprecated in the future due the lack of continued support for new models and features.
|
||||
|
||||
Before you begin, make sure the following libraries are installed and updated to the latest release:
|
||||
Install Accelerate, Transformers and Optimum first.
|
||||
|
||||
```bash
|
||||
pip install --upgrade accelerate optimum transformers
|
||||
```
|
||||
|
||||
Then install either GPTQModel or AutoGPTQ.
|
||||
Then run the command below to install a GPTQ library.
|
||||
|
||||
<hfoptions id="install">
|
||||
<hfoption id="GPTQmodel">
|
||||
|
||||
```bash
|
||||
pip install gptqmodel --no-build-isolation
|
||||
```
|
||||
|
||||
or
|
||||
</hfoption>
|
||||
<hfoption id="AutoGPTQ">
|
||||
|
||||
```bash
|
||||
pip install auto-gptq --no-build-isolation
|
||||
```
|
||||
|
||||
To quantize a model (currently only supported for text models), you need to create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset.
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calbrate the weights for quantization, and a tokenizer to prepare the dataset.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
|
||||
|
||||
model_id = "facebook/opt-125m"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
|
||||
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
You could also pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper.
|
||||
You can pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper.
|
||||
|
||||
```py
|
||||
dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
|
||||
gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
Load a model to quantize and pass the `gptq_config` to the [`~AutoModelForCausalLM.from_pretrained`] method. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization.
|
||||
Load a model to quantize and pass [`GPTQConfig`] to [`~AutoModelForCausalLM.from_pretrained`]. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization.
|
||||
|
||||
```py
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", device_map="auto", quantization_config=gptq_config)
|
||||
```
|
||||
|
||||
If you're running out of memory because a dataset is too large, disk offloading is not supported. If this is the case, try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU):
|
||||
If you're running out of memory because a dataset is too large (disk offloading is not supported), try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU).
|
||||
|
||||
```py
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"facebook/opt-125m",
|
||||
device_map="auto",
|
||||
max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"},
|
||||
quantization_config=gptq_config
|
||||
)
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
> [!WARNING]
|
||||
> Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
|
||||
|
||||
Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
|
||||
|
||||
</Tip>
|
||||
|
||||
Once your model is quantized, you can push the model and tokenizer to the Hub where it can be easily shared and accessed. Use the [`~PreTrainedModel.push_to_hub`] method to save the [`GPTQConfig`]:
|
||||
Once a model is quantized, you can use [`~PreTrainedModel.push_to_hub`] to push the model and tokenizer to the Hub where it can be easily shared and accessed. This saves the [`GPTQConfig`].
|
||||
|
||||
```py
|
||||
quantized_model.push_to_hub("opt-125m-gptq")
|
||||
tokenizer.push_to_hub("opt-125m-gptq")
|
||||
```
|
||||
|
||||
You could also save your quantized model locally with the [`~PreTrainedModel.save_pretrained`] method. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. For example, to save the model on a CPU:
|
||||
[`~PreTrainedModel.save_pretrained`] saves a quantized model locally. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. The example below saves the model on a CPU.
|
||||
|
||||
```py
|
||||
quantized_model.save_pretrained("opt-125m-gptq")
|
||||
@@ -111,7 +100,7 @@ quantized_model.to("cpu")
|
||||
quantized_model.save_pretrained("opt-125m-gptq")
|
||||
```
|
||||
|
||||
Reload a quantized model with the [`~PreTrainedModel.from_pretrained`] method, and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed.
|
||||
Reload a quantized model with [`~PreTrainedModel.from_pretrained`], and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
@@ -134,27 +123,49 @@ model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", de
|
||||
|
||||
## ExLlama
|
||||
|
||||
[ExLlama](https://github.com/turboderp/exllama) is a CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object. To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter:
|
||||
> [!WARNING]
|
||||
> Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT.
|
||||
|
||||
[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object.
|
||||
|
||||
To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter in [`GPTQConfig`].
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, GPTQConfig
|
||||
|
||||
gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
|
||||
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"{your_username}/opt-125m-gptq",
|
||||
device_map="auto",
|
||||
quantization_config=gptq_config
|
||||
)
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT.
|
||||
|
||||
</Tip>
|
||||
|
||||
The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ or GPTQModel, then you'll need to disable the ExLlama kernel. This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file.
|
||||
The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ 0.4.2+, disable the ExLlama kernel in [`GPTQConfig`]. This overwrites the attributes related to the ExLlama kernels in the quantization config of the `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, GPTQConfig
|
||||
|
||||
gptq_config = GPTQConfig(bits=4, use_exllama=False)
|
||||
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="cpu", quantization_config=gptq_config)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"{your_username}/opt-125m-gptq",
|
||||
device_map="cpu",
|
||||
quantization_config=gptq_config
|
||||
)
|
||||
```
|
||||
|
||||
## GPTQModel
|
||||
|
||||
It is recommended to use GPTQModel, originally a maintained fork of AutoGPTQ, because it has since diverged from AutoGTPQ with some significant features. GPTQModel has faster quantization, lower memory usage, and more accurate default quantization.
|
||||
|
||||
GPTQModel provides asymmetric quantization which can potentially lower quantization errors compared to symmetric quantization. It is not backward compatible with AutoGPTQ, and not all kernels (Marlin) support asymmetric quantization.
|
||||
|
||||
GPTQModel also has broader support for the latest LLM models, multimodal models (Qwen2-VL and Ovis1.6-VL), platforms (Linux, macOS, Windows 11), and hardware (AMD ROCm, Apple Silicon, Intel/AMD CPUs, and Intel Datacenter Max/Arc GPUs, etc.).
|
||||
|
||||
The Marlin kernels are also updated for A100 GPUs and other kernels are updated to include auto-padding for legacy models and models with non-uniform in/out-features.
|
||||
|
||||
## Resources
|
||||
|
||||
Run the GPTQ quantization with PEFT [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) for a hands-on experience, and read [Making LLMs lighter with AutoGPTQ and transformers](https://huggingface.co/blog/gptq-integration) to learn more about the AutoGPTQ integration.
|
||||
|
||||
@@ -16,11 +16,30 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# HIGGS
|
||||
|
||||
HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper [arxiv.org/abs/2411.17525](https://arxiv.org/abs/2411.17525).
|
||||
[HIGGS](https://arxiv.org/abs/2411.17525) is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.
|
||||
|
||||
Runtime support for HIGGS is implemented through [FLUTE](https://arxiv.org/abs/2407.10960), and its [library](https://github.com/HanGuo97/flute).
|
||||
Runtime support for HIGGS is implemented through the [FLUTE](https://github.com/HanGuo97/flute) library. Only the 70B and 405B variants of Llama 3 and Llama 3.0, and the 8B and 27B variants of Gemma 2 are currently supported. HIGGS also doesn't support quantized training and backward passes in general at the moment.
|
||||
|
||||
## Quantization Example
|
||||
Run the command below to install FLUTE.
|
||||
|
||||
<hfoptions id="install">
|
||||
<hfoption id="CUDA 12.1">
|
||||
|
||||
```bash
|
||||
pip install flute-kernel
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="CUDA 11.8">
|
||||
|
||||
```bash
|
||||
pip install flute-kernel -i https://flute-ai.github.io/whl/cu12.4
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Create a [`HiggsConfig`] with the number of bits to quantize a model to.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
|
||||
@@ -30,37 +49,32 @@ model = AutoModelForCausalLM.from_pretrained(
|
||||
quantization_config=HiggsConfig(bits=4),
|
||||
device_map="auto",
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
|
||||
|
||||
tokenizer.decode(model.generate(
|
||||
**tokenizer("Hi,", return_tensors="pt").to(model.device),
|
||||
temperature=0.5,
|
||||
top_p=0.80,
|
||||
)[0])
|
||||
```
|
||||
|
||||
## Pre-quantized models
|
||||
> [!TIP]
|
||||
> Find models pre-quantized with HIGGS in the official ISTA-DASLab [collection](https://huggingface.co/collections/ISTA-DASLab/higgs-675308e432fd56b7f6dab94e).
|
||||
|
||||
Some pre-quantized models can be found in the [official collection](https://huggingface.co/collections/ISTA-DASLab/higgs-675308e432fd56b7f6dab94e) on Hugging Face Hub.
|
||||
## torch.compile
|
||||
|
||||
## Current Limitations
|
||||
HIGGS is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
|
||||
|
||||
**Architectures**
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
|
||||
|
||||
Currently, FLUTE, and HIGGS by extension, **only support Llama 3 and 3.0 of 8B, 70B and 405B parameters, as well as Gemma-2 9B and 27B**. We're working on allowing to run more diverse models as well as allow arbitrary models by modifying the FLUTE compilation procedure.
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"google/gemma-2-9b-it",
|
||||
quantization_config=HiggsConfig(bits=4),
|
||||
device_map="auto",
|
||||
)
|
||||
|
||||
**torch.compile**
|
||||
model = torch.compile(model)
|
||||
```
|
||||
|
||||
HIGGS is fully compatible with `torch.compile`. Compiling `model.forward`, as described [here](../perf_torch_compile.md), here're the speedups it provides on RTX 4090 for `Llama-3.1-8B-Instruct` (forward passes/sec):
|
||||
Refer to the table below for a benchmark of forward passes/sec for Llama-3.1-8B-Instruct on a RTX4090.
|
||||
|
||||
| Batch Size | BF16 (With `torch.compile`) | HIGGS 4bit (No `torch.compile`) | HIGGS 4bit (With `torch.compile`) |
|
||||
| Batch Size | BF16 (with `torch.compile`) | HIGGS 4bit (without `torch.compile`) | HIGGS 4bit (with `torch.compile`) |
|
||||
|------------|-----------------------------|----------------------------------|-----------------------------------|
|
||||
| 1 | 59 | 41 | 124 |
|
||||
| 4 | 57 | 42 | 123 |
|
||||
| 16 | 56 | 41 | 120 |
|
||||
|
||||
|
||||
**Quantized training**
|
||||
|
||||
Currently, HIGGS doesn't support quantized training (and backward passes in general). We're working on adding support for it.
|
||||
@@ -14,27 +14,43 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# HQQ
|
||||
|
||||
# HQQ
|
||||
[Half-Quadratic Quantization (HQQ)](https://github.com/mobiusml/hqq/) supports fast on-the-fly quantization for 8, 4, 3, 2, and even 1-bits. It doesn't require calibration data, and it is compatible with any model modality (LLMs, vision, etc.).
|
||||
|
||||
Half-Quadratic Quantization (HQQ) implements on-the-fly quantization via fast robust optimization. It doesn't require calibration data and can be used to quantize any model.
|
||||
Please refer to the <a href="https://github.com/mobiusml/hqq/">official package</a> for more details.
|
||||
HQQ further supports fine-tuning with [PEFT](https://huggingface.co/docs/peft) and is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training.
|
||||
|
||||
For installation, we recommend you use the following approach to get the latest version and build its corresponding CUDA kernels:
|
||||
```
|
||||
Install HQQ with the following command to get the latest version and to build its corresponding CUDA kernels.
|
||||
|
||||
```bash
|
||||
pip install hqq
|
||||
```
|
||||
|
||||
To quantize a model, you need to create an [`HqqConfig`]. There are two ways of doing it:
|
||||
``` Python
|
||||
You can choose to either replace all the linear layers in a model with the same quantization config or dedicate a specific quantization config for specific linear layers.
|
||||
|
||||
<hfoptions id="hqq">
|
||||
<hfoption id="replace all layers">
|
||||
|
||||
Quantize a model by creating a [`HqqConfig`] and specifying the `nbits` and `group_size` to replace for all the linear layers ([torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)) of the model.
|
||||
|
||||
``` py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
|
||||
|
||||
# Method 1: all linear layers will use the same quantization config
|
||||
quant_config = HqqConfig(nbits=8, group_size=64)
|
||||
quant_config = HqqConfig(nbits=8, group_size=64)
|
||||
model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
torch_dtype=torch.float16,
|
||||
device_map="cuda",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
```
|
||||
|
||||
``` Python
|
||||
# Method 2: each linear layer with the same tag will use a dedicated quantization config
|
||||
</hfoption>
|
||||
<hfoption id="specific layers only">
|
||||
|
||||
Quantize a model by creating a dictionary specifying the `nbits` and `group_size` for the linear layers to quantize. Pass them to [`HqqConfig`] and set which layers to quantize with the config. This approach is especially useful for quantizing mixture-of-experts (MoEs) because they are less affected ly lower quantization settings.
|
||||
|
||||
``` py
|
||||
q4_config = {'nbits':4, 'group_size':64}
|
||||
q3_config = {'nbits':3, 'group_size':32}
|
||||
quant_config = HqqConfig(dynamic_config={
|
||||
@@ -47,23 +63,38 @@ quant_config = HqqConfig(dynamic_config={
|
||||
'mlp.up_proj' :q3_config,
|
||||
'mlp.down_proj':q3_config,
|
||||
})
|
||||
```
|
||||
|
||||
The second approach is especially interesting for quantizing Mixture-of-Experts (MoEs) because the experts are less affected by lower quantization settings.
|
||||
|
||||
|
||||
Then you simply quantize the model as follows
|
||||
``` Python
|
||||
model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
torch_dtype=torch.float16,
|
||||
device_map="cuda",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
```
|
||||
|
||||
## Optimized Runtime
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
HQQ supports various backends, including pure PyTorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
|
||||
For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090.
|
||||
For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend
|
||||
## Backends
|
||||
|
||||
HQQ supports various backends, including pure PyTorch and custom dequantization CUDA kernels. These backends are suitable for older GPUs and PEFT/QLoRA training.
|
||||
|
||||
```py
|
||||
from hqq.core.quantize import *
|
||||
|
||||
HQQLinear.set_backend(HQQBackend.PYTORCH)
|
||||
```
|
||||
|
||||
For faster inference, HQQ supports 4-bit fused kernels (torchao and Marlin) after a model is quantized. These can reach up to 200 tokens/sec on a single 4090. The example below demonstrates enabling the torchao_int4 backend.
|
||||
|
||||
```py
|
||||
from hqq.utils.patching import prepare_for_inference
|
||||
|
||||
prepare_for_inference("model", backend="torchao_int4")
|
||||
```
|
||||
|
||||
Refer to the [Backend](https://github.com/mobiusml/hqq/#backend) guide for more details.
|
||||
|
||||
## Resources
|
||||
|
||||
Read the [Half-Quadratic Quantization of Large Machine Learning Models](https://mobiusml.github.io/hqq_blog/) blog post for more details about HQQ.
|
||||
|
||||
@@ -16,4 +16,4 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# Optimum
|
||||
|
||||
The [Optimum](https://huggingface.co/docs/optimum/index) library supports quantization for Intel, Furiosa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. Consider using Optimum for quantization if you're using specific and optimized hardware like Intel CPUs, Furiosa NPUs or a model accelerator like ONNX Runtime.
|
||||
[Optimum](https://huggingface.co/docs/optimum/index) is an optimization library that supports quantization for Intel, Furiousa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. It is designed to enhance performance for specific hardware - Intel CPUs/HPUs, AMD GPUs, Furiousa NPUs, etc. - and model accelerators like ONNX Runtime.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
@@ -14,82 +14,36 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Quantization
|
||||
# Overview
|
||||
|
||||
Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.
|
||||
Quantization lowers the memory requirements of loading and using a model by storing the weights in a lower precision while trying to preserve as much accuracy as possible. Weights are typically stored in full-precision (fp32) floating point representations, but half-precision (fp16 or bf16) are increasingly popular data types given the large size of models today. Some quantization methods can reduce the precision even further to integer representations, like int8 or int4.
|
||||
|
||||
<Tip>
|
||||
Transformers supports many quantization methods, each with their pros and cons, so you can pick the best one for your specific use case. Some methods require calibration for greater accuracy and extreme compression (1-2 bits), while other methods work out of the box with on-the-fly quantization.
|
||||
|
||||
Interested in adding a new quantization method to Transformers? Read the [HfQuantizer](./contribute) guide to learn how!
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:
|
||||
|
||||
* [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/)
|
||||
* [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/)
|
||||
|
||||
</Tip>
|
||||
|
||||
## When to use what?
|
||||
|
||||
The community has developed many quantization methods for various use cases. With Transformers, you can run any of these integrated methods depending on your use case because each method has their own pros and cons.
|
||||
|
||||
For example, some quantization methods require calibrating the model with a dataset for more accurate and "extreme" compression (up to 1-2 bits quantization), while other methods work out of the box with on-the-fly quantization.
|
||||
|
||||
Another parameter to consider is compatibility with your target device. Do you want to quantize on a CPU, GPU, or Apple silicon?
|
||||
|
||||
In short, supporting a wide range of quantization methods allows you to pick the best quantization method for your specific use case.
|
||||
|
||||
Use the table below to help you decide which quantization method to use.
|
||||
Use the Space below to help you pick a quantization method depending on your hardware and number of bits to quantize to.
|
||||
|
||||
| Quantization Method | On the fly quantization | CPU | CUDA GPU | ROCm GPU | Metal (Apple Silicon) | Intel GPU | Torch compile() | Bits | PEFT Fine Tuning | Serializable with 🤗Transformers | 🤗Transformers Support | Link to library |
|
||||
|-----------------------------------------------|----------------------|-----------------|----------|-----------|------------------------------------|-----------------|-----------------|---------------|------------------|-----------------------------|-------------------------|---------------------------------------------|
|
||||
| [AQLM](./aqlm.md) | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 1/2 | 🟢 | 🟢 | 🟢 | https://github.com/Vahe1994/AQLM |
|
||||
| [AWQ](./awq.md) | 🔴 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | ? | 4 | 🟢 | 🟢 | 🟢 | https://github.com/casper-hansen/AutoAWQ |
|
||||
| [bitsandbytes](./bitsandbytes.md) | 🟢 | 🟡 <sub>1</sub> | 🟢 | 🟡 <sub>1</sub> | 🔴 <sub>2</sub> | 🟡 <sub>1</sub> | 🔴 <sub>1</sub> | 4/8 | 🟢 | 🟢 | 🟢 | https://github.com/bitsandbytes-foundation/bitsandbytes |
|
||||
| [bitsandbytes](./bitsandbytes.md) | 🟢 | 🟡 | 🟢 | 🟡 | 🔴 | 🟡 | 🔴 | 4/8 | 🟢 | 🟢 | 🟢 | https://github.com/bitsandbytes-foundation/bitsandbytes |
|
||||
| [compressed-tensors](./compressed_tensors.md) | 🔴 | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 1/8 | 🟢 | 🟢 | 🟢 | https://github.com/neuralmagic/compressed-tensors |
|
||||
| [EETQ](./eetq.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | ? | 8 | 🟢 | 🟢 | 🟢 | https://github.com/NetEase-FuXi/EETQ |
|
||||
| [GGUF / GGML (llama.cpp)](../gguf.md) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 1/8 | 🔴 | [See Notes](../gguf.md) | [See Notes](../gguf.md) | https://github.com/ggerganov/llama.cpp |
|
||||
| [GPTQModel](./gptq.md) | 🔴 | 🟢 <sub>3</sub> | 🟢 | 🟢 | 🟢 | 🟢 <sub>4</sub> | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/ModelCloud/GPTQModel |
|
||||
| [GPTQModel](./gptq.md) | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/ModelCloud/GPTQModel |
|
||||
| [AutoGPTQ](./gptq.md) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ |
|
||||
| [HIGGS](./higgs.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 2/4 | 🔴 | 🟢 | 🟢 | https://github.com/HanGuo97/flute |
|
||||
| [HQQ](./hqq.md) | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 1/8 | 🟢 | 🔴 | 🟢 | https://github.com/mobiusml/hqq/ |
|
||||
| [optimum-quanto](./quanto.md) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 2/4/8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/optimum-quanto |
|
||||
| [FBGEMM_FP8](./fbgemm_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
|
||||
| [torchao](./torchao.md) | 🟢 | 🟢 | 🟢 | 🔴 | 🟡 <sub>5</sub> | 🔴 | | 4/8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
|
||||
| [torchao](./torchao.md) | 🟢 | 🟢 | 🟢 | 🔴 | 🟡 | 🔴 | | 4/8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
|
||||
| [VPTQ](./vptq.md) | 🔴 | 🔴 | 🟢 | 🟡 | 🔴 | 🔴 | 🟢 | 1/8 | 🔴 | 🟢 | 🟢 | https://github.com/microsoft/VPTQ |
|
||||
| [SpQR](./spqr.md) | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 3 | 🔴 | 🟢 | 🟢 | https://github.com/Vahe1994/SpQR/ |
|
||||
| [FINEGRAINED_FP8](./finegrained_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | |
|
||||
<Tip>
|
||||
|
||||
**1:** bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend). Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
| [SpQR](./spqr.md) | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 3 | 🔴 | 🟢 | 🟢 | https://github.com/Vahe1994/SpQR/ |
|
||||
|
||||
</Tip>
|
||||
## Resources
|
||||
|
||||
<Tip>
|
||||
|
||||
**2:** bitsandbytes is seeking contributors to help develop and lead the Apple Silicon backend. Interested? Contact them directly via their repo. Stipends may be available through sponsorships.
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
**3:** GPTQModel[CPU] supports 4-bit via IPEX on Intel/AMD and full bit range via Torch on Intel/AMD/Apple Silicon.
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
**4:** GPTQModel[Intel GPU] via IPEX only supports 4-bit for Intel Datacenter Max/Arc GPUs.
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
**5:** torchao only supports int4 weight on Metal (Apple Silicon).
|
||||
|
||||
</Tip>
|
||||
If you are new to quantization, we recommend checking out these beginner-friendly quantization courses in collaboration with DeepLearning.AI.
|
||||
|
||||
* [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/)
|
||||
* [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/)
|
||||
|
||||
@@ -14,55 +14,56 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Optimum-quanto
|
||||
# Optimum Quanto
|
||||
|
||||
<Tip>
|
||||
[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/index). It features linear quantization for weights (float8, int8, int4, int2) with accuracy very similar to full-precision models. Quanto is compatible with any model modality and device, making it simple to use regardless of hardware.
|
||||
|
||||
Try optimum-quanto + transformers with this [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing)!
|
||||
Quanto is also compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for faster generation.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
[🤗 optimum-quanto](https://github.com/huggingface/optimum-quanto) library is a versatile pytorch quantization toolkit. The quantization method used is the linear quantization. Quanto provides several unique features such as:
|
||||
|
||||
- weights quantization (`float8`,`int8`,`int4`,`int2`)
|
||||
- activation quantization (`float8`,`int8`)
|
||||
- modality agnostic (e.g CV,LLM)
|
||||
- device agnostic (e.g CUDA,XPU,MPS,CPU)
|
||||
- compatibility with `torch.compile`
|
||||
- easy to add custom kernel for specific device
|
||||
- supports quantization aware training
|
||||
<!-- Add link to the blogpost -->
|
||||
|
||||
Before you begin, make sure the following libraries are installed:
|
||||
Install Quanto with the following command.
|
||||
|
||||
```bash
|
||||
pip install optimum-quanto accelerate transformers
|
||||
```
|
||||
|
||||
Now you can quantize a model by passing [`QuantoConfig`] object in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it contains `torch.nn.Linear` layers.
|
||||
Quantize a model by creating a [`QuantoConfig`] and specifiying the `weights` parameter to quantize to. This works for any model in any modality as long as it contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers.
|
||||
|
||||
The integration with transformers only supports weights quantization. For the more complex use case such as activation quantization, calibration and quantization aware training, you should use [optimum-quanto](https://github.com/huggingface/optimum-quanto) library instead.
|
||||
|
||||
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
|
||||
> [!TIP]
|
||||
> The Transformers integration only supports weight quantization. Use the Quanto library directly if you need activation quantization, calibration, or QAT.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
|
||||
|
||||
model_id = "facebook/opt-125m"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
quantization_config = QuantoConfig(weights="int8")
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="cuda:0", quantization_config=quantization_config)
|
||||
quant_config = QuantoConfig(weights="int8")
|
||||
model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
```
|
||||
|
||||
Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.
|
||||
## torch.compile
|
||||
|
||||
Optimum-quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following benchmark (llama-2-7b on perplexity metric). You can find more benchmarks [here](https://github.com/huggingface/optimum-quanto/tree/main/bench/generation)
|
||||
Wrap a Quanto model with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for faster generation.
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/NousResearch-Llama-2-7b-hf_Perplexity.png" alt="llama-2-7b-quanto-perplexity" />
|
||||
</div>
|
||||
</div>
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForSpeechSeq2Seq, QuantoConfig
|
||||
|
||||
The library is versatile enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).
|
||||
quant_config = QuantoConfig(weights="int8")
|
||||
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
||||
"openai/whisper-large-v2",
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
|
||||
model = torch.compile(model)
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
Read the [Quanto: a PyTorch quantization backend for Optimum](https://huggingface.co/blog/quanto-introduction) blog post to learn more about the library design and benchmarks.
|
||||
|
||||
For more hands-on examples, take a look at the Quanto [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing).
|
||||
@@ -16,11 +16,16 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# SpQR
|
||||
|
||||
[SpQR](https://github.com/Vahe1994/SpQR) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outliers as detailed in [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078).
|
||||
The [SpQR]((https://hf.co/papers/2306.03078)) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure with sparse outliers.
|
||||
|
||||
To SpQR-quantize a model, refer to the [Vahe1994/SpQR](https://github.com/Vahe1994/SpQR) repository.
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/spqr-diagram.png">
|
||||
</div>
|
||||
|
||||
Load a pre-SpQR-quantized model in [`~PreTrainedModel.from_pretrained`].
|
||||
> [!TIP]
|
||||
> To quantize a model with SpQR, refer to the [Vahe1994/SpQR](https://github.com/Vahe1994/SpQR) repository.
|
||||
|
||||
Load a SpQR-quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
@@ -9,46 +9,53 @@ specific language governing permissions and limitations under the License.
|
||||
rendered properly in your Markdown viewer.
|
||||
-->
|
||||
|
||||
# TorchAO
|
||||
# torchao
|
||||
|
||||
[TorchAO](https://github.com/pytorch/ao) is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features like `torch.compile`, FSDP etc.. Some benchmark numbers can be found [here](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks).
|
||||
[torchao](https://github.com/pytorch/ao) is a PyTorch architecture optimization library with support for custom high performance data types, quantization, and sparsity. It is composable with native PyTorch features such as [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training.
|
||||
|
||||
Before you begin, make sure the following libraries are installed with their latest version:
|
||||
Install torchao with the following command.
|
||||
|
||||
```bash
|
||||
# Updating 🤗 Transformers to the latest version, as the example script below uses the new auto compilation
|
||||
pip install --upgrade torch torchao transformers
|
||||
```
|
||||
|
||||
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
|
||||
torchao supports many quantization types for different data types (int4, float8, weight only, etc.), but the Transformers integration only currently supports int8 weight quantization and int8 dynamic quantization of weights.
|
||||
|
||||
You can manually choose the quantization types and settings or automatically select the quantization types.
|
||||
|
||||
## Manually Choose Quantization Types and Settings
|
||||
<hfoptions id="torchao">
|
||||
<hfoption id="manual">
|
||||
|
||||
`torchao` Provides many commonly used types of quantization, including different dtypes like int4, float8 and different flavors like weight only, dynamic quantization etc., only `int4_weight_only`, `int8_weight_only` and `int8_dynamic_activation_int8_weight` are integrated into hugigngface transformers currently, but we can add more when needed.
|
||||
If you want to run the following codes on CPU even with GPU available, just change `device_map="cpu"` and `quantization_config = TorchAoConfig("int4_weight_only", group_size=128, layout=Int4CPULayout())` where `layout` comes from `from torchao.dtypes import Int4CPULayout` which is only available from torchao 0.8.0 and higher.
|
||||
Create a [`TorchAoConfig`] and specify the quantization type and `group_size` of the weights to quantize. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method.
|
||||
|
||||
Users can manually specify the quantization types and settings they want to use:
|
||||
> [!TIP]
|
||||
> Run the quantized model on a CPU by changing `device_map` to `"cpu"` and `layout` to `Int4CPULayout()`. This is only available in torchao 0.8.0+.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "meta-llama/Meta-Llama-3-8B"
|
||||
# We support int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight
|
||||
# More examples and documentations for arguments can be found in https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
|
||||
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Meta-Llama-3-8B",
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device)
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speedup
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
# benchmark the performance
|
||||
Run the code below to benchmark the quantized models performance.
|
||||
|
||||
```py
|
||||
from torch._inductor.utils import do_bench_using_profiling
|
||||
from typing import Callable
|
||||
|
||||
@@ -64,32 +71,44 @@ print("int4wo-128 model:", benchmark_fn(quantized_model.generate, **input_ids, m
|
||||
bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
|
||||
output = bf16_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") # auto-compile
|
||||
print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))
|
||||
|
||||
```
|
||||
|
||||
## Automatically Select Quantization Types
|
||||
</hfoption>
|
||||
<hfoption id="automatic">
|
||||
|
||||
`torchao` also provies `autoquant` feature that automatically chooses a quantization type for quantizable layers such as linear based on microbenchmarks of quantizing and compiling a single linear layer.
|
||||
The [autoquant](https://pytorch.org/ao/stable/generated/torchao.quantization.autoquant.html#torchao.quantization.autoquant) API automatically chooses a quantization type for quantizable layers (`nn.Linear`) by micro-benchmarking on input type and shape and compiling a single linear layer.
|
||||
|
||||
Create a [`TorchAoConfig`] and set to `"autoquant"`. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method. Finally, call `finalize_autoquant` on the quantized model to finalize the quantization and log the input shapes.
|
||||
|
||||
> [!TIP]
|
||||
> Run the quantized model on a CPU by changing `device_map` to `"cpu"` and `layout` to `Int4CPULayout()`. This is only available in torchao 0.8.0+.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "meta-llama/Meta-Llama-3-8B"
|
||||
quantization_config = TorchAoConfig("autoquant", min_sqnr=None)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Meta-Llama-3-8B",
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speedup
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
# Due to some implementation details we are explicitly calling this now, we may refactor our code and remove this in the future
|
||||
# explicitly call `finalize_autoquant` (may be refactored and removed in the future)
|
||||
quantized_model.finalize_autoquant()
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
# benchmark the performance
|
||||
Run the code below to benchmark the quantized models performance.
|
||||
|
||||
```py
|
||||
from torch._inductor.utils import do_bench_using_profiling
|
||||
from typing import Callable
|
||||
|
||||
@@ -102,32 +121,28 @@ def benchmark_fn(func: Callable, *args, **kwargs) -> float:
|
||||
MAX_NEW_TOKENS = 1000
|
||||
print("autoquantized model:", benchmark_fn(quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))
|
||||
|
||||
bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", torch_dtype=torch.bfloat16)
|
||||
bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
|
||||
output = bf16_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") # auto-compile
|
||||
print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))
|
||||
|
||||
```
|
||||
|
||||
## Serialization and Deserialization
|
||||
torchao quantization is implemented with [tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor), it only work with huggingface non-safetensor serialization and deserialization. It relies on `torch.load(..., weights_only=True)` to avoid arbitrary user code execution during load time and use [add_safe_globals](https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals) to allowlist some known user functions.
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
The reason why it does not support safe tensor serialization is that wrapper tensor subclass allows maximum flexibility so we want to make sure the effort of supporting new format of quantized Tensor is low, while safe tensor optimizes for maximum safety (no user code execution), it also means we have to make sure to manually support new quantization format.
|
||||
## Serialization
|
||||
|
||||
torchao implements [torch.Tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) for maximum flexibility in supporting new quantized torch.Tensor formats. [Safetensors](https://huggingface.co/docs/safetensors/en/index) serialization and deserialization does not work with torchaco.
|
||||
|
||||
To avoid arbitrary user code execution, torchao sets `weights_only=True` in [torch.load](https://pytorch.org/docs/stable/generated/torch.load.html) to ensure only tensors are loaded. Any known user functions can be whitelisted with [add_safe_globals](https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals).
|
||||
|
||||
```py
|
||||
# save quantized model locally
|
||||
# don't serialize model with Safetensors
|
||||
output_dir = "llama3-8b-int4wo-128"
|
||||
quantized_model.save_pretrained(output_dir, safe_serialization=False)
|
||||
|
||||
# push to huggingface hub
|
||||
# save_to = "{user_id}/llama3-8b-int4wo-128"
|
||||
# quantized_model.push_to_hub(save_to, safe_serialization=False)
|
||||
|
||||
# load quantized model
|
||||
ckpt_id = "llama3-8b-int4wo-128" # or huggingface hub model id
|
||||
loaded_quantized_model = AutoModelForCausalLM.from_pretrained(ckpt_id, device_map="auto")
|
||||
|
||||
|
||||
# confirm the speedup
|
||||
loaded_quantized_model = torch.compile(loaded_quantized_model, mode="max-autotune")
|
||||
print("loaded int4wo-128 model:", benchmark_fn(loaded_quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))
|
||||
quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False)
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
For a better sense of expected performance, view the [benchmarks](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks) for various models with CUDA and XPU backends.
|
||||
|
||||
Refer to [Other Available Quantization Techniques](https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques) for more examples and documentation.
|
||||
|
||||
@@ -14,34 +14,33 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# VPTQ
|
||||
# VPTQ
|
||||
|
||||
> [!TIP]
|
||||
> Try VPTQ on [Hugging Face](https://huggingface.co/spaces/microsoft/VPTQ)!
|
||||
> Try VPTQ on [Google Colab](https://colab.research.google.com/github/microsoft/VPTQ/blob/main/notebooks/vptq_example.ipynb)!
|
||||
> Know more about VPTQ on [ArXiv](https://arxiv.org/pdf/2409.17066)!
|
||||
[Vector Post-Training Quantization (VPTQ)](https://github.com/microsoft/VPTQ) is a Post-Training Quantization (PTQ) method that leverages vector quantization to quantize LLMs at an extremely low bit-width (<2-bit). VPTQ can compress a 70B, even a 405B model, to 1-2 bits without retraining and still maintain a high-degree of accuracy. It is a lightweight quantization algorithm that takes ~17 hours to quantize a 405B model. VPTQ features agile quantization inference with low decoding overhead and high throughput and Time To First Token (TTFT).
|
||||
|
||||
Vector Post-Training Quantization ([VPTQ](https://github.com/microsoft/VPTQ)) is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
|
||||
Run the command below to install VPTQ which provides efficient kernels for inference on NVIDIA and AMD GPUs.
|
||||
|
||||
- Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit)
|
||||
- Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
|
||||
- Agile Quantization Inference: low decode overhead, best throughput, and TTFT
|
||||
|
||||
Inference support for VPTQ is released in the `vptq` library. Make sure to install it to run the models:
|
||||
```bash
|
||||
pip install vptq
|
||||
```
|
||||
|
||||
The library provides efficient kernels for NVIDIA/AMD GPU inference.
|
||||
The [VPTQ-community](https://huggingface.co/VPTQ-community) provides a collection of VPTQ-quantized models. The model name contains information about its bitwidth (excluding cookbook, parameter, and padding overhead). Consider the [Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft] model as an example.
|
||||
|
||||
To run VPTQ models simply load a model that has been quantized with VPTQ:
|
||||
- The model name is Meta-Llama-3.1-70B-Instruct.
|
||||
- The number of centroids is given by 65536 (2^16).
|
||||
- The number of residual centroids is given by 256 (2^8).
|
||||
|
||||
## Inference example
|
||||
**Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time**
|
||||

|
||||
The equivalent bit-width calculation is given by the following.
|
||||
|
||||
- index: log2(65536) = 16 / 8 = 2-bits
|
||||
- residual index: log2(256) = 8 / 8 = 1-bit
|
||||
- total bit-width: 2 + 1 = 3-bits
|
||||
|
||||
```python
|
||||
From here, estimate the model size by multiplying 70B * 3-bits / 8-bits/byte for a total of 26.25GB.
|
||||
|
||||
Load a VPTQ quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
@@ -49,19 +48,14 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
torch_dtype="auto",
|
||||
device_map="auto"
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woft")
|
||||
input_ids = tokenizer("hello, it's me", return_tensors="pt").to("cuda")
|
||||
out = model.generate(**input_ids, max_new_tokens=32, do_sample=False)
|
||||
```
|
||||
|
||||
## Quantize your own model
|
||||
VPTQ algorithm early-released at [VPTQ ](https://github.com/microsoft/VPTQ/tree/algorithm),
|
||||
and checkout the [tutorial](https://github.com/microsoft/VPTQ/blob/algorithm/algorithm.md).
|
||||
To quantize your own model, refer to the [VPTQ Quantization Algorithm Tutorial](https://github.com/microsoft/VPTQ/blob/algorithm/algorithm.md) tutorial.
|
||||
|
||||
## Benchmarks
|
||||
|
||||
## Early Results from Tech Report
|
||||
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
|
||||
|
||||
|
||||
| Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
|
||||
| ----------- | -------- | ---- | ---- | ------ | ------ | ------- | ------- |
|
||||
| LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
|
||||
@@ -71,41 +65,8 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over
|
||||
| LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
|
||||
| | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
|
||||
|
||||
## Resources
|
||||
|
||||
See an example demo of VPTQ on the VPTQ Online Demo [Space](https://huggingface.co/spaces/microsoft/VPTQ) or try running the VPTQ inference [notebook](https://colab.research.google.com/github/microsoft/VPTQ/blob/main/notebooks/vptq_example.ipynb).
|
||||
|
||||
## More Models in [VPTQ-community](https://huggingface.co/VPTQ-community)
|
||||
|
||||
⚠️ The repository only provides a method of model quantization algorithm.
|
||||
|
||||
⚠️ The open-source community VPTQ-community provides models based on the technical report and quantization algorithm.
|
||||
|
||||
|
||||
|
||||
**Quick Estimation of Model Bitwidth (Excluding Codebook Overhead)**:
|
||||
|
||||
- **Model Naming Convention**: The model's name includes the **vector length** $v$, **codebook (lookup table) size**, and **residual codebook size**. For example, "Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft" is "Meta-Llama-3.1-70B-Instruct", where:
|
||||
- **Vector Length**: 8
|
||||
- **Number of Centroids**: 65536 (2^16)
|
||||
- **Number of Residual Centroids**: 256 (2^8)
|
||||
- **Equivalent Bitwidth Calculation**:
|
||||
- **Index**: log2(65536) = 16 / 8 = 2 bits
|
||||
- **Residual Index**: log2(256) = 8 / 8 = 1 bit
|
||||
- **Total Bitwidth**: 2 + 1 = 3 bits
|
||||
- **Model Size Estimation**: 70B * 3 bits / 8 bits per Byte = 26.25 GB
|
||||
|
||||
- **Note**: This estimate does not include the size of the codebook (lookup table), other parameter overheads, and the padding overhead for storing indices. For the detailed calculation method, please refer to **Tech Report Appendix C.2**.
|
||||
|
||||
|
||||
| Model Series | Collections | (Estimated) Bit per weight |
|
||||
| :--------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Llama 3.1 Nemotron 70B Instruct HF | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-nemotron-70b-instruct-hf-without-finetune-671730b96f16208d0b3fe942) | [4 bits](https://huggingface.co/VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-65536-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v8-k65536-0-woft) [1.875 bits](https://huggingface.co/VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-16384-woft) [1.625 bits](https://huggingface.co/VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-1024-woft) [1.5 bits](https://huggingface.co/VPTQ-community/Llama-3.1-Nemotron-70B-Instruct-HF-v16-k65536-256-woft) |
|
||||
| Llama 3.1 8B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-8b-instruct-without-finetune-66f2b70b1d002ceedef02d2e) | [4 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woft) [3.5 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-4096-woft) [3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft) [2.3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v12-k65536-4096-woft) |
|
||||
| Llama 3.1 70B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-70b-instruct-without-finetune-66f2bf454d3dd78dfee2ff11) | [4 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft) [2.25 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-4-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft) [1.93 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-32768-woft) [1.875 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft) [1.75 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k16384-0-woft) |
|
||||
| Llama 3.1 405B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-405b-instruct-without-finetune-66f4413f9ba55e1a9e52cfb0) | [4 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v8-k65536-256-woft) [2 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-65536-woft) [1.875 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k32768-32768-woft) [1.625 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-1024-woft) [1.5 bits (1)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v8-k4096-0-woft) [1.5 bits (2)](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-256-woft) [1.43 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-128-woft) [1.375 bits](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-405B-Instruct-v16-k65536-64-woft) |
|
||||
| Mistral Large Instruct 2407 (123B) | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-mistral-large-instruct-2407-without-finetune-6711ebfb7faf85eed9cceb16) | [4 bits](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v16-k65536-65536-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-0-woft) [1.875 bits](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v16-k65536-16384-woft) [1.75 bits](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v16-k65536-4096-woft) [1.625 bits](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v16-k65536-1024-woft) [1.5 bits](https://huggingface.co/VPTQ-community/Mistral-Large-Instruct-2407-v16-k65536-256-woft) |
|
||||
| Qwen 2.5 7B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-7b-instruct-without-finetune-66f3e9866d3167cc05ce954a) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k256-256-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft) [2 bits (3)](https://huggingface.co/VPTQ-community/Qwen2.5-7B-Instruct-v16-k65536-65536-woft) |
|
||||
| Qwen 2.5 14B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-14b-instruct-without-finetune-66f827f83c7ffa7931b8376c) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k256-256-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v8-k65536-0-woft) [2 bits (3)](https://huggingface.co/VPTQ-community/Qwen2.5-14B-Instruct-v16-k65536-65536-woft) |
|
||||
| Qwen 2.5 32B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-32b-instruct-without-finetune-66fe77173bf7d64139f0f613) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-32B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-32B-Instruct-v8-k65536-256-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-32B-Instruct-v16-k65536-65536-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-32B-Instruct-v8-k65536-0-woft) [2 bits (3)](https://huggingface.co/VPTQ-community/Qwen2.5-32B-Instruct-v8-k256-256-woft) |
|
||||
| Qwen 2.5 72B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-72b-instruct-without-finetune-66f3bf1b3757dfa1ecb481c0) | [4 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-65536-woft) [3 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-256-woft) [2.38 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k1024-512-woft) [2.25 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k512-512-woft) [2.25 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-4-woft) [2 bits (1)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-0-woft) [2 bits (2)](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-65536-woft) [1.94 bits](https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-32768-woft) |
|
||||
| Reproduced from the tech report | [HF 🤗](https://huggingface.co/collections/VPTQ-community/reproduced-vptq-tech-report-baseline-66fbf1dffe741cc9e93ecf04) | Results from the open source community for reference only, please use them responsibly. |
|
||||
| Hessian and Inverse Hessian Matrix | [HF 🤗](https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b) | Collected from RedPajama-Data-1T-Sample, following [Quip#](https://github.com/Cornell-RelaxML/quip-sharp/blob/main/quantize_llama/hessian_offline_llama.py)
|
||||
For more information, read the VPTQ [paper](https://arxiv.org/pdf/2409.17066).
|
||||
|
||||
Reference in New Issue
Block a user