AQLM quantizer support (#28928)
* aqlm init * calibration and dtypes * docs * Readme update * is_aqlm_available * Simpler link in docs * Test TODO real reference * init _import_structure fix * AqlmConfig autodoc * integration aqlm * integrations in tests * docstring fix * legacy typing * Less typings * More kernels information * Performance -> Accuracy * correct tests * remoced multi-gpu test * Update docs/source/en/quantization.md Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/utils/quantization_config.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Brought back multi-gpu tests * Update src/transformers/integrations/aqlm.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update tests/quantization/aqlm_integration/test_aqlm.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> --------- Co-authored-by: Andrei Panferov <blacksamorez@yandex-team.ru> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
This commit is contained in:
@@ -26,6 +26,10 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
|
||||
|
||||
</Tip>
|
||||
|
||||
## AqlmConfig
|
||||
|
||||
[[autodoc]] AqlmConfig
|
||||
|
||||
## AwqConfig
|
||||
|
||||
[[autodoc]] AwqConfig
|
||||
|
||||
@@ -26,6 +26,34 @@ Interested in adding a new quantization method to Transformers? Read the [HfQuan
|
||||
|
||||
</Tip>
|
||||
|
||||
## AQLM
|
||||
|
||||
|
||||
|
||||
Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)!
|
||||
|
||||
Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
|
||||
|
||||
Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10):
|
||||
```bash
|
||||
pip install aqlm[gpu,cpu]
|
||||
```
|
||||
|
||||
The library provides efficient kernels for both GPU and CPU inference.
|
||||
|
||||
The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub [repository](https://github.com/Vahe1994/AQLM).
|
||||
|
||||
### AQLM configurations
|
||||
|
||||
AQLM quantization setpus vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
|
||||
|
||||
| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference |
|
||||
|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
|
||||
| Triton | K | N | KxN | - | Up to ~0.7x | ✅ | ❌ |
|
||||
| CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x | ✅ | ❌ |
|
||||
| CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x | ✅ | ❌ |
|
||||
| Numba | K | 8 | Kx8 | Good | Up to ~4.0x | ❌ | ✅ |
|
||||
|
||||
## AWQ
|
||||
|
||||
<Tip>
|
||||
|
||||
Reference in New Issue
Block a user