Add HQQ quantization support (#29637)
* update HQQ transformers integration * push import_utils.py * add force_hooks check in modeling_utils.py * fix | with Optional * force bias as param * check bias is Tensor * force forward for multi-gpu * review fixes pass * remove torch grad() * if any key in linear_tags fix * add cpu/disk check * isinstance return * add multigpu test + refactor tests * clean hqq_utils imports in hqq.py * clean hqq_utils imports in quantizer_hqq.py * delete hqq_utils.py * Delete src/transformers/utils/hqq_utils.py * ruff init * remove torch.float16 from __init__ in test * refactor test * isinstance -> type in quantizer_hqq.py * cpu/disk device_map check in quantizer_hqq.py * remove type(module) nn.linear check in quantizer_hqq.py * add BaseQuantizeConfig import inside HqqConfig init * remove hqq import in hqq.py * remove accelerate import from test_hqq.py * quant config.py doc update * add hqqconfig to main_classes doc * make style * __init__ fix * ruff __init__ * skip_modules list * hqqconfig format fix * hqqconfig doc fix * hqqconfig doc fix * hqqconfig doc fix * hqqconfig doc fix * hqqconfig doc fix * hqqconfig doc fix * hqqconfig doc fix * hqqconfig doc fix * hqqconfig doc fix * test_hqq.py remove mistral comment * remove self.using_multi_gpu is False * torch_dtype default val set and logger.info * hqq.py isinstance fix * remove torch=None * torch_device test_hqq * rename test_hqq * MODEL_ID in test_hqq * quantizer_hqq setattr fix * quantizer_hqq typo fix * imports quantizer_hqq.py * isinstance quantizer_hqq * hqq_layer.bias reformat quantizer_hqq * Step 2 as comment in quantizer_hqq * prepare_for_hqq_linear() comment * keep_in_fp32_modules fix * HqqHfQuantizer reformat * quantization.md hqqconfig * quantization.md model example reformat * quantization.md # space * quantization.md space }) * quantization.md space }) * quantization_config fix doc Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * axis value check in quantization_config * format * dynamic config explanation * quant config method in quantization.md * remove shard-level progress * .cuda fix modeling_utils * test_hqq fixes * make fix-copies --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
4
docs/source/en/main_classes/quantization.md
Normal file → Executable file
4
docs/source/en/main_classes/quantization.md
Normal file → Executable file
@@ -52,3 +52,7 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
|
||||
## HfQuantizer
|
||||
|
||||
[[autodoc]] quantizers.base.HfQuantizer
|
||||
|
||||
## HqqConfig
|
||||
|
||||
[[autodoc]] HqqConfig
|
||||
|
||||
50
docs/source/en/quantization.md
Normal file → Executable file
50
docs/source/en/quantization.md
Normal file → Executable file
@@ -745,3 +745,53 @@ The speed and throughput of fused and unfused modules were also tested with the
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generate throughput/batch size</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## HQQ
|
||||
Half-Quadratic Quantization (HQQ) implements on-the-fly quantization via fast robust optimization. It doesn't require calibration data and can be used to quantize any model.
|
||||
Please refer to the <a href="https://github.com/mobiusml/hqq/">official package</a> for more details.
|
||||
|
||||
For installation, we recommend you use the following approach to get the latest version and build its corresponding CUDA kernels:
|
||||
```
|
||||
pip install hqq
|
||||
```
|
||||
|
||||
To quantize a model, you need to create an [`HqqConfig`]. There are two ways of doing it:
|
||||
``` Python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
|
||||
|
||||
# Method 1: all linear layers will use the same quantization config
|
||||
quant_config = HqqConfig(nbits=8, group_size=64, quant_zero=False, quant_scale=False, axis=0) #axis=0 is used by default
|
||||
```
|
||||
|
||||
``` Python
|
||||
# Method 2: each linear layer with the same tag will use a dedicated quantization config
|
||||
q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}
|
||||
q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}
|
||||
quant_config = HqqConfig(dynamic_config={
|
||||
'self_attn.q_proj':q4_config,
|
||||
'self_attn.k_proj':q4_config,
|
||||
'self_attn.v_proj':q4_config,
|
||||
'self_attn.o_proj':q4_config,
|
||||
|
||||
'mlp.gate_proj':q3_config,
|
||||
'mlp.up_proj' :q3_config,
|
||||
'mlp.down_proj':q3_config,
|
||||
})
|
||||
```
|
||||
|
||||
The second approach is especially interesting for quantizing Mixture-of-Experts (MoEs) because the experts are less affected by lower quantization settings.
|
||||
|
||||
|
||||
Then you simply quantize the model as follows
|
||||
``` Python
|
||||
model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.float16,
|
||||
device_map="cuda",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
```
|
||||
### Optimized Runtime
|
||||
HQQ supports various backends, including pure Pytorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
|
||||
For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090.
|
||||
For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend
|
||||
0
docs/source/en/quicktour.md
Normal file → Executable file
0
docs/source/en/quicktour.md
Normal file → Executable file
Reference in New Issue
Block a user