FEAT / Bitsandbytes: Add dequantize API for bitsandbytes quantized models (#30806)
* add method * change method name * more comments * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fixup * add docstrings and fix comment * warn users on the de-quantized dtype * Update src/transformers/quantizers/base.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/integrations/bitsandbytes.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * final suggestion - use private method --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
@@ -642,6 +642,27 @@ double_quant_config = BitsAndBytesConfig(
|
||||
model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)
|
||||
```
|
||||
|
||||
### Dequantizing `bitsandbytes` models
|
||||
|
||||
Once quantized, you can dequantize the model to the original precision. Note this might result in a small quality loss of the model. Make also sure to have enough GPU RAM to fit the dequantized model.
|
||||
Below is how to perform dequantization on a 4-bit model using `bitsandbytes`.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
|
||||
|
||||
model_id = "facebook/opt-125m"
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
model.dequantize()
|
||||
|
||||
text = tokenizer("Hello my name is", return_tensors="pt").to(0)
|
||||
|
||||
out = model.generate(**text)
|
||||
print(tokenizer.decode(out[0]))
|
||||
```
|
||||
|
||||
## EETQ
|
||||
The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization.
|
||||
|
||||
@@ -794,4 +815,4 @@ model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
### Optimized Runtime
|
||||
HQQ supports various backends, including pure Pytorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
|
||||
For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090.
|
||||
For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend
|
||||
For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend
|
||||
|
||||
Reference in New Issue
Block a user