[Quantization] Quanto quantizer (#29023)

* start integration

* fix

* add and debug tests

* update tests

* make pytorch serialization works

* compatible with device_map and offload

* fix tests

* make style

* add ref

* guard against safetensors

* add float8 and style

* fix is_serializable

* Fix shard_checkpoint compatibility with quanto

* more tests

* docs

* adjust memory

* better

* style

* pass tests

* Update src/transformers/modeling_utils.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* add is_safe_serialization instead

* Update src/transformers/quantizers/quantizer_quanto.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* add QbitsTensor tests

* fix tests

* simplify activation list

* Update docs/source/en/quantization.md

Co-authored-by: David Corvoysier <david.corvoysier@gmail.com>

* better comment

* Update tests/quantization/quanto_integration/test_quanto.py

Co-authored-by: David Corvoysier <david.corvoysier@gmail.com>

* Update tests/quantization/quanto_integration/test_quanto.py

Co-authored-by: David Corvoysier <david.corvoysier@gmail.com>

* find and fix edge case

* Update docs/source/en/quantization.md

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* pass weights_only_kwarg instead

* fix shard_checkpoint loading

* simplify update_missing_keys

* Update tests/quantization/quanto_integration/test_quanto.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* recursion to get all tensors

* block serialization

* skip serialization tests

* fix

* change by cuda:0 for now

* fix regression

* update device_map

* fix doc

* add noteboon

* update torch_dtype

* update doc

* typo

* typo

* remove comm

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: David Corvoysier <david.corvoysier@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Younes Belkada <younesbelkada@gmail.com>
This commit is contained in:
Marc Sun
2024-03-15 11:51:29 -04:00
committed by GitHub
parent f02aea2737
commit 28de2f4de3
18 changed files with 885 additions and 7 deletions

View File

@@ -26,6 +26,10 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
</Tip>
## QuantoConfig
[[autodoc]] QuantoConfig
## AqlmConfig
[[autodoc]] AqlmConfig

View File

@@ -26,6 +26,59 @@ Interested in adding a new quantization method to Transformers? Read the [HfQuan
</Tip>
## Quanto
<Tip>
Try Quanto + transformers with this [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing)!
</Tip>
[🤗 Quanto](https://github.com/huggingface/quanto) library is a versatile pytorch quantization toolkit. The quantization method used is the linear quantization. Quanto provides several unique features such as:
- weights quantization (`float8`,`int8`,`int4`,`int2`)
- activation quantization (`float8`,`int8`)
- modality agnostic (e.g CV,LLM)
- device agnostic (e.g CUDA,MPS,CPU)
- compatibility with `torch.compile`
- easy to add custom kernel for specific device
- supports quantization aware training
<!-- Add link to the blogpost -->
Before you begin, make sure the following libraries are installed:
```bash
pip install quanto
pip install git+https://github.com/huggingface/accelerate.git
pip install git+https://github.com/huggingface/transformers.git
```
Now you can quantize a model by passing [`QuantoConfig`] object in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it contains `torch.nn.Linear` layers.
The integration with transformers only supports weights quantization. For the more complex use case such as activation quantization, calibration and quantization aware training, you should use [quanto](https://github.com/huggingface/quanto) library instead.
```py
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)
```
Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.
Quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following becnhmark (llama-2-7b on perplexity metric). You can find more benchamarks [here](https://github.com/huggingface/quanto/tree/main/bench/generation)
<div class="flex gap-4">
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/NousResearch-Llama-2-7b-hf_Perplexity.png" alt="llama-2-7b-quanto-perplexity" />
</div>
</div>
The library is versatible enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).
## AQLM