HFQuantizer implementation for compressed-tensors library (#31704)
* Add compressed-tensors HFQuantizer implementation * flag serializable as False * run * revive lines deleted by ruff * fixes to load+save from sparseml, edit config to quantization_config, and load back * address satrat comment * compressed_tensors to compressed-tensors and revert back is_serializable * rename quant_method from sparseml to compressed-tensors * tests * edit tests * clean up tests * make style * cleanup * cleanup * add test skip for when compressed tensors is not installed * remove pydantic import + style * delay torch import in test * initial docs * update main init for compressed tensors config * make fix-copies * docstring * remove fill_docstring * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * review comments * review comments * comments - suppress warnings on state dict load, tests, fixes * bug-fix - remove unnecessary call to apply quant lifecycle * run_compressed compatability * revert changes not needed for compression * no longer need unexpected keys fn * unexpected keys not needed either * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * add to_diff_dict * update docs and expand testing * Update _toctree.yml with compressed-tensors * Update src/transformers/utils/quantization_config.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * update doc * add note about saving a loaded model --------- Co-authored-by: George Ohashi <george@neuralmagic.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Sara Adkins <sara@neuralmagic.com> Co-authored-by: Sara Adkins <sara.adkins65@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: Dipika <dipikasikka1@gmail.com>
This commit is contained in:
@@ -177,6 +177,8 @@
|
||||
title: Optimum
|
||||
- local: quantization/torchao
|
||||
title: TorchAO
|
||||
- local: quantization/compressed_tensors
|
||||
title: compressed-tensors
|
||||
- local: quantization/contribute
|
||||
title: Contribute new quantization method
|
||||
title: Quantization Methods
|
||||
|
||||
@@ -61,7 +61,10 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
|
||||
|
||||
[[autodoc]] FbgemmFp8Config
|
||||
|
||||
## CompressedTensorsConfig
|
||||
|
||||
[[autodoc]] CompressedTensorsConfig
|
||||
|
||||
## TorchAoConfig
|
||||
|
||||
[[autodoc]] TorchAoConfig
|
||||
|
||||
|
||||
230
docs/source/en/quantization/compressed_tensors.md
Normal file
230
docs/source/en/quantization/compressed_tensors.md
Normal file
@@ -0,0 +1,230 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
# Compressed Tensors
|
||||
|
||||
The [`compressed-tensors`](https://github.com/neuralmagic/compressed-tensors) library provides a versatile and efficient way to store and manage compressed model checkpoints. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.
|
||||
|
||||
Some of the supported formats include:
|
||||
1. `dense`
|
||||
2. `int-quantized`: INT8 quantized models
|
||||
- sample [model/config](https://huggingface.co/nm-testing/tinyllama-w8a8-compressed-hf-quantizer)
|
||||
3. `float-quantized`: FP8 quantized models; currently support E4M3
|
||||
- sample [model/config](https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat/tree/main)
|
||||
4. `pack-quantized`: INT4 or INT8 weight-quantized models, packed into INT32. For INT4, the weights have an INT4 range but are stored as INT8 and then packed into INT32.
|
||||
- sample [model/config](nm-testing/tinyllama-w4a16-compressed-hf-quantizer)
|
||||
|
||||
Compressed models can be easily created using [llm-compressor](https://github.com/vllm-project/llm-compressor).
|
||||
Alternatively models can be created indepedenty and serialized with a compressed tensors config.
|
||||
|
||||
To find existing models on the Hugging Face Model Hub, search for the [`compressed-tensors` tag](https://huggingface.co/models?other=compressed-tensors).
|
||||
|
||||
#### Features:
|
||||
- Weight and activation precisions: FP8, INT4, INT8 (for Q/DQ arbitrary precision is allowed for INT)
|
||||
- Quantization scales and zero-points strategies: [tensor, channel, group, block, token](https://github.com/neuralmagic/compressed-tensors/blob/83b2e7a969d70606421a76b9a3d112646077c8de/src/compressed_tensors/quantization/quant_args.py#L43-L52)
|
||||
- Dynamic per-token activation quantization (or any static strategy)
|
||||
- Sparsity can be
|
||||
- Supports quantization of arbitrary modules, not just Linear modules
|
||||
- Targeted support or ignoring of modules by name or class
|
||||
|
||||
## Installation
|
||||
|
||||
It is recommended to install stable releases of compressed-tensors from [PyPI](https://pypi.org/project/compressed-tensors):
|
||||
```bash
|
||||
pip install compressed-tensors
|
||||
```
|
||||
|
||||
Developers who want to experiment with the latest features can also install the package from source:
|
||||
```bash
|
||||
git clone https://github.com/neuralmagic/compressed-tensors
|
||||
cd compressed-tensors
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## Quickstart Model Load
|
||||
Quantized models can be easily loaded for inference as shown below. Only models that have already been quantized can be loaded at the moment. To quantize a model into the compressed-tensors format see [llm-compressor](https://github.com/vllm-project/llm-compressor).
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
# Load the model in compressed-tensors format
|
||||
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
|
||||
|
||||
# Measure memory usage
|
||||
mem_params = sum([param.nelement()*param.element_size() for param in ct_model.parameters()])
|
||||
print(f"{mem/2**30:.4f} GB")
|
||||
# 8.4575 GB
|
||||
```
|
||||
|
||||
We can see just above that the compressed-tensors FP8 checkpoint of Llama 3.1 8B is able to be loaded for inference using half of the memory of the unquantized reference checkpoint.
|
||||
|
||||
## Sample Use Cases - Load and run an FP8 model
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
prompt = [
|
||||
"Hello, my name is",
|
||||
"The capital of France is",
|
||||
"The future of AI is"
|
||||
]
|
||||
|
||||
model_name = "nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat"
|
||||
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
|
||||
inputs = tokenizer(prompt, return_tensors="pt")
|
||||
generated_ids = quantized_model.generate(**inputs, max_length=50, do_sample=False)
|
||||
outputs = tokenizer.batch_decode(generated_ids)
|
||||
|
||||
print(outputs)
|
||||
|
||||
"""
|
||||
['<|begin_of_text|>Hello, my name is [Name]. I am a [Your Profession/Student] and I am here to learn about the [Course/Program] at [University/Institution]. I am excited to be here and I am looking forward to', '<|begin_of_text|>The capital of France is Paris, which is located in the north-central part of the country. Paris is the most populous city in France and is known for its stunning architecture, art museums, fashion, and romantic atmosphere. The city is home to', "<|begin_of_text|>The future of AI is here, and it's already changing the way we live and work. From virtual assistants to self-driving cars, AI is transforming industries and revolutionizing the way we interact with technology. But what does the future of AI hold"]
|
||||
"""
|
||||
|
||||
```
|
||||
|
||||
The above shows a quick example for running generation using a `compressed-tensors`
|
||||
model. Currently, once loaded the model cannot be saved.
|
||||
|
||||
## Deep dive into a compressed-tensors model checkpoint
|
||||
|
||||
In this example we will examine how the compressed-tensors model nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf is defined through its configuration entry and see how this translates to the loaded model representation.
|
||||
|
||||
First, let us look at the [`quantization_config` of the model](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf/blob/main/config.json). At a glance it looks overwhelming with the number of entries but this is because compressed-tensors is a format that allows for flexible expression both during and after model compression.
|
||||
|
||||
In practice for checkpoint loading and inference the configuration can be simplified to not include all the default or empty entries, so we will do that here to focus on what compression is actually represented.
|
||||
|
||||
```yaml
|
||||
"quantization_config": {
|
||||
"config_groups": {
|
||||
"group_0": {
|
||||
"input_activations": {
|
||||
"num_bits": 8,
|
||||
"strategy": "tensor",
|
||||
"type": "float"
|
||||
},
|
||||
"targets": ["Linear"],
|
||||
"weights": {
|
||||
"num_bits": 8,
|
||||
"strategy": "tensor",
|
||||
"type": "float"
|
||||
}
|
||||
}
|
||||
},
|
||||
"format": "naive-quantized",
|
||||
"ignore": ["lm_head"],
|
||||
"quant_method": "compressed-tensors",
|
||||
"quantization_status": "frozen"
|
||||
},
|
||||
```
|
||||
|
||||
We can see from the above configuration that it is specifying one config group that includes weight and activation quantization to FP8 with a static per-tensor strategy. It is also worth noting that in the `ignore` list there is an entry to skip quantization of the `lm_head` module, so that module should be untouched in the checkpoint.
|
||||
|
||||
To see the result of the configuration in practice, we can simply use the [safetensors viewer](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf?show_file_info=model.safetensors.index.json) on the model card to see the quantized weights, input_scale, and weight_scale for all of the Linear modules in the first model layer (and so on for the rest of the layers).
|
||||
|
||||
| Tensors | Shape | Precision |
|
||||
| ------- | ----- | --------- |
|
||||
model.layers.0.input_layernorm.weight | [4 096] | BF16
|
||||
model.layers.0.mlp.down_proj.input_scale | [1] | BF16
|
||||
model.layers.0.mlp.down_proj.weight | [4 096, 14 336] | F8_E4M3
|
||||
model.layers.0.mlp.down_proj.weight_scale | [1] | BF16
|
||||
model.layers.0.mlp.gate_proj.input_scale | [1] | BF16
|
||||
model.layers.0.mlp.gate_proj.weight | [14 336, 4 096] | F8_E4M3
|
||||
model.layers.0.mlp.gate_proj.weight_scale | [1] | BF16
|
||||
model.layers.0.mlp.up_proj.input_scale| [1] |BF16
|
||||
model.layers.0.mlp.up_proj.weight | [14 336, 4 096] | F8_E4M3
|
||||
model.layers.0.mlp.up_proj.weight_scale | [1] | BF16
|
||||
model.layers.0.post_attention_layernorm.weight | [4 096] |BF16
|
||||
model.layers.0.self_attn.k_proj.input_scale | [1] | BF16
|
||||
model.layers.0.self_attn.k_proj.weight | [1 024, 4 096]| F8_E4M3
|
||||
model.layers.0.self_attn.k_proj.weight_scale |[1] | BF16
|
||||
model.layers.0.self_attn.o_proj.input_scale | [1] | BF16
|
||||
model.layers.0.self_attn.o_proj.weight | [4 096, 4 096] | F8_E4M3
|
||||
model.layers.0.self_attn.o_proj.weight_scale | [1] | BF16
|
||||
model.layers.0.self_attn.q_proj.input_scale | [1] | BF16
|
||||
model.layers.0.self_attn.q_proj.weight | [4 096, 4 096] | F8_E4M3
|
||||
model.layers.0.self_attn.q_proj.weight_scale | [1] | BF16
|
||||
model.layers.0.self_attn.v_proj.input_scale | [1] | BF16
|
||||
model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3
|
||||
model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16
|
||||
|
||||
When we load the model with the compressed-tensors HFQuantizer integration, we can see that all of the Linear modules that are specified within the quantization configuration have been replaced by `CompressedLinear` modules that manage the compressed weights and forward pass for inference. Note that the `lm_head` mentioned before in the ignore list is still kept as an unquantized Linear module.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
|
||||
print(ct_model)
|
||||
"""
|
||||
LlamaForCausalLM(
|
||||
(model): LlamaModel(
|
||||
(embed_tokens): Embedding(128256, 4096)
|
||||
(layers): ModuleList(
|
||||
(0-31): 32 x LlamaDecoderLayer(
|
||||
(self_attn): LlamaSdpaAttention(
|
||||
(q_proj): CompressedLinear(
|
||||
in_features=4096, out_features=4096, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(k_proj): CompressedLinear(
|
||||
in_features=4096, out_features=1024, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(v_proj): CompressedLinear(
|
||||
in_features=4096, out_features=1024, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(o_proj): CompressedLinear(
|
||||
in_features=4096, out_features=4096, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(rotary_emb): LlamaRotaryEmbedding()
|
||||
)
|
||||
(mlp): LlamaMLP(
|
||||
(gate_proj): CompressedLinear(
|
||||
in_features=4096, out_features=14336, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(up_proj): CompressedLinear(
|
||||
in_features=4096, out_features=14336, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(down_proj): CompressedLinear(
|
||||
in_features=14336, out_features=4096, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(act_fn): SiLU()
|
||||
)
|
||||
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
|
||||
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
|
||||
)
|
||||
)
|
||||
(norm): LlamaRMSNorm((4096,), eps=1e-05)
|
||||
(rotary_emb): LlamaRotaryEmbedding()
|
||||
)
|
||||
(lm_head): Linear(in_features=4096, out_features=128256, bias=False)
|
||||
)
|
||||
"""
|
||||
```
|
||||
@@ -50,6 +50,7 @@ Use the table below to help you decide which quantization method to use.
|
||||
| [AQLM](./aqlm) | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 1 / 2 | 🟢 | 🟢 | 🟢 | https://github.com/Vahe1994/AQLM |
|
||||
| [AWQ](./awq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | ? | 4 | 🟢 | 🟢 | 🟢 | https://github.com/casper-hansen/AutoAWQ |
|
||||
| [bitsandbytes](./bitsandbytes) | 🟢 | 🟡 * | 🟢 | 🟡 * | 🔴 ** | 🔴 (soon!) | 4 / 8 | 🟢 | 🟢 | 🟢 | https://github.com/bitsandbytes-foundation/bitsandbytes |
|
||||
| [compressed-tensors](./compressed_tensors) | 🔴 | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 1 - 8 | 🟢 | 🟢 | 🟢 | https://github.com/neuralmagic/compressed-tensors |
|
||||
| [EETQ](./eetq) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | ? | 8 | 🟢 | 🟢 | 🟢 | https://github.com/NetEase-FuXi/EETQ |
|
||||
| GGUF / GGML (llama.cpp) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 1 - 8 | 🔴 | [See GGUF section](../gguf) | [See GGUF section](../gguf) | https://github.com/ggerganov/llama.cpp |
|
||||
| [GPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 2 - 3 - 4 - 8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ |
|
||||
|
||||
@@ -958,6 +958,7 @@ _import_structure = {
|
||||
"AqlmConfig",
|
||||
"AwqConfig",
|
||||
"BitsAndBytesConfig",
|
||||
"CompressedTensorsConfig",
|
||||
"EetqConfig",
|
||||
"FbgemmFp8Config",
|
||||
"GPTQConfig",
|
||||
@@ -5802,6 +5803,7 @@ if TYPE_CHECKING:
|
||||
AqlmConfig,
|
||||
AwqConfig,
|
||||
BitsAndBytesConfig,
|
||||
CompressedTensorsConfig,
|
||||
EetqConfig,
|
||||
FbgemmFp8Config,
|
||||
GPTQConfig,
|
||||
|
||||
@@ -19,6 +19,7 @@ from ..utils.quantization_config import (
|
||||
AqlmConfig,
|
||||
AwqConfig,
|
||||
BitsAndBytesConfig,
|
||||
CompressedTensorsConfig,
|
||||
EetqConfig,
|
||||
FbgemmFp8Config,
|
||||
GPTQConfig,
|
||||
@@ -32,6 +33,7 @@ from .quantizer_aqlm import AqlmHfQuantizer
|
||||
from .quantizer_awq import AwqQuantizer
|
||||
from .quantizer_bnb_4bit import Bnb4BitHfQuantizer
|
||||
from .quantizer_bnb_8bit import Bnb8BitHfQuantizer
|
||||
from .quantizer_compressed_tensors import CompressedTensorsHfQuantizer
|
||||
from .quantizer_eetq import EetqHfQuantizer
|
||||
from .quantizer_fbgemm_fp8 import FbgemmFp8HfQuantizer
|
||||
from .quantizer_gptq import GptqHfQuantizer
|
||||
@@ -49,6 +51,7 @@ AUTO_QUANTIZER_MAPPING = {
|
||||
"quanto": QuantoHfQuantizer,
|
||||
"eetq": EetqHfQuantizer,
|
||||
"hqq": HqqHfQuantizer,
|
||||
"compressed-tensors": CompressedTensorsHfQuantizer,
|
||||
"fbgemm_fp8": FbgemmFp8HfQuantizer,
|
||||
"torchao": TorchAoHfQuantizer,
|
||||
}
|
||||
@@ -62,6 +65,7 @@ AUTO_QUANTIZATION_CONFIG_MAPPING = {
|
||||
"aqlm": AqlmConfig,
|
||||
"quanto": QuantoConfig,
|
||||
"hqq": HqqConfig,
|
||||
"compressed-tensors": CompressedTensorsConfig,
|
||||
"fbgemm_fp8": FbgemmFp8Config,
|
||||
"torchao": TorchAoConfig,
|
||||
}
|
||||
|
||||
77
src/transformers/quantizers/quantizer_compressed_tensors.py
Normal file
77
src/transformers/quantizers/quantizer_compressed_tensors.py
Normal file
@@ -0,0 +1,77 @@
|
||||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from ..utils import is_compressed_tensors_available, is_torch_available, logging
|
||||
from ..utils.quantization_config import QuantizationConfigMixin
|
||||
from .base import HfQuantizer
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class CompressedTensorsHfQuantizer(HfQuantizer):
|
||||
"""
|
||||
Quantizer for the compressed_tensors package. Loads and restores models to
|
||||
quantized state with compressed_tensors
|
||||
"""
|
||||
|
||||
requires_calibration = True
|
||||
required_packages = ["compressed_tensors"]
|
||||
|
||||
def __init__(self, quantization_config: QuantizationConfigMixin, **kwargs):
|
||||
super().__init__(quantization_config, **kwargs)
|
||||
|
||||
from compressed_tensors.compressors import ModelCompressor
|
||||
|
||||
self.compressor = ModelCompressor.from_compression_config(quantization_config)
|
||||
|
||||
def validate_environment(self, *args, **kwargs):
|
||||
if not is_compressed_tensors_available():
|
||||
raise ImportError(
|
||||
"Using `compressed_tensors` quantized models requires the compressed-tensors library: "
|
||||
"`pip install compressed-tensors`"
|
||||
)
|
||||
if not is_torch_available():
|
||||
# torch already should be installed as part of compressed tensors
|
||||
raise ImportError("torch is required for using compressed-tensors quantization")
|
||||
|
||||
def update_torch_dtype(self, torch_dtype: "torch.dtype") -> "torch.dtype":
|
||||
if torch_dtype is None:
|
||||
logger.info("Loading model using torch.float16 for compressed-tensors quantization")
|
||||
torch_dtype = torch.float16
|
||||
elif torch_dtype != torch.float16:
|
||||
logger.info(
|
||||
"We suggest you to set `torch_dtype=torch.float16` for better efficiency with compressed_tensors."
|
||||
)
|
||||
return torch_dtype
|
||||
|
||||
def _process_model_before_weight_loading(self, model, **kwargs):
|
||||
from compressed_tensors.quantization import apply_quantization_config
|
||||
|
||||
ct_quantization_config = self.compressor.quantization_config
|
||||
apply_quantization_config(model, ct_quantization_config, run_compressed=True)
|
||||
|
||||
def _process_model_after_weight_loading(self, model, **kwargs):
|
||||
pass
|
||||
|
||||
@property
|
||||
def is_trainable(self):
|
||||
return False
|
||||
|
||||
@property
|
||||
def is_serializable(self):
|
||||
return False
|
||||
@@ -63,6 +63,7 @@ from .utils import (
|
||||
is_bitsandbytes_available,
|
||||
is_bitsandbytes_multi_backend_available,
|
||||
is_bs4_available,
|
||||
is_compressed_tensors_available,
|
||||
is_cv2_available,
|
||||
is_cython_available,
|
||||
is_decord_available,
|
||||
@@ -1199,6 +1200,13 @@ def require_quanto(test_case):
|
||||
return unittest.skipUnless(is_quanto_available(), "test requires quanto")(test_case)
|
||||
|
||||
|
||||
def require_compressed_tensors(test_case):
|
||||
"""
|
||||
Decorator for compressed_tensors dependency
|
||||
"""
|
||||
return unittest.skipUnless(is_compressed_tensors_available(), "test requires compressed_tensors")(test_case)
|
||||
|
||||
|
||||
def require_fbgemm_gpu(test_case):
|
||||
"""
|
||||
Decorator for fbgemm_gpu dependency
|
||||
|
||||
@@ -124,6 +124,7 @@ from .import_utils import (
|
||||
is_bitsandbytes_multi_backend_available,
|
||||
is_bs4_available,
|
||||
is_coloredlogs_available,
|
||||
is_compressed_tensors_available,
|
||||
is_cv2_available,
|
||||
is_cython_available,
|
||||
is_datasets_available,
|
||||
|
||||
@@ -142,6 +142,7 @@ _auto_gptq_available = _is_package_available("auto_gptq")
|
||||
# `importlib.metadata.version` doesn't work with `awq`
|
||||
_auto_awq_available = importlib.util.find_spec("awq") is not None
|
||||
_quanto_available = _is_package_available("quanto")
|
||||
_compressed_tensors_available = _is_package_available("compressed_tensors")
|
||||
_pandas_available = _is_package_available("pandas")
|
||||
_peft_available = _is_package_available("peft")
|
||||
_phonemizer_available = _is_package_available("phonemizer")
|
||||
@@ -963,6 +964,10 @@ def is_quanto_available():
|
||||
return _quanto_available
|
||||
|
||||
|
||||
def is_compressed_tensors_available():
|
||||
return _compressed_tensors_available
|
||||
|
||||
|
||||
def is_auto_gptq_available():
|
||||
return _auto_gptq_available
|
||||
|
||||
|
||||
@@ -42,6 +42,7 @@ class QuantizationMethod(str, Enum):
|
||||
QUANTO = "quanto"
|
||||
EETQ = "eetq"
|
||||
HQQ = "hqq"
|
||||
COMPRESSED_TENSORS = "compressed-tensors"
|
||||
FBGEMM_FP8 = "fbgemm_fp8"
|
||||
TORCHAO = "torchao"
|
||||
|
||||
@@ -1051,6 +1052,130 @@ class EetqConfig(QuantizationConfigMixin):
|
||||
raise ValueError(f"Only support weights in {accepted_weights} but found {self.weights}")
|
||||
|
||||
|
||||
class CompressedTensorsConfig(QuantizationConfigMixin):
|
||||
"""
|
||||
This is a wrapper class that handles compressed-tensors quantization config options.
|
||||
It is a wrapper around `compressed_tensors.QuantizationConfig`
|
||||
Args:
|
||||
config_groups (`typing.Dict[str, typing.Union[ForwardRef('QuantizationScheme'), typing.List[str]]]`, *optional*):
|
||||
dictionary mapping group name to a quantization scheme definition
|
||||
format (`str`, *optional*, defaults to `"dense"`):
|
||||
format the model is represented as
|
||||
quantization_status (`QuantizationStatus`, *optional*, defaults to `"initialized"`):
|
||||
status of model in the quantization lifecycle, ie 'initialized', 'calibration', 'frozen'
|
||||
kv_cache_scheme (`typing.Union[QuantizationArgs, NoneType]`, *optional*):
|
||||
specifies quantization of the kv cache. If None, kv cache is not quantized.
|
||||
global_compression_ratio (`typing.Union[float, NoneType]`, *optional*):
|
||||
0-1 float percentage of model compression
|
||||
ignore (`typing.Union[typing.List[str], NoneType]`, *optional*):
|
||||
layer names or types to not quantize, supports regex prefixed by 're:'
|
||||
sparsity_config (`typing.Dict[str, typing.Any]`, *optional*):
|
||||
configuration for sparsity compression
|
||||
quant_method (`str`, *optional*, defaults to `"compressed-tensors"`):
|
||||
do not override, should be compressed-tensors
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
config_groups: Dict[str, Union["QuantizationScheme", List[str]]] = None, # noqa: F821
|
||||
format: str = "dense",
|
||||
quantization_status: "QuantizationStatus" = "initialized", # noqa: F821
|
||||
kv_cache_scheme: Optional["QuantizationArgs"] = None, # noqa: F821
|
||||
global_compression_ratio: Optional[float] = None,
|
||||
ignore: Optional[List[str]] = None,
|
||||
sparsity_config: Dict[str, Any] = None,
|
||||
quant_method: str = "compressed-tensors",
|
||||
**kwargs,
|
||||
):
|
||||
from compressed_tensors import QuantizationConfig
|
||||
from compressed_tensors.config import SparsityCompressionConfig
|
||||
|
||||
self.quantization_config = None
|
||||
self.sparsity_config = None
|
||||
|
||||
# parse from dict to load nested QuantizationScheme objects
|
||||
if config_groups:
|
||||
self.quantization_config = QuantizationConfig.parse_obj(
|
||||
{
|
||||
"config_groups": config_groups,
|
||||
"quant_method": quant_method,
|
||||
"format": format,
|
||||
"quantization_status": quantization_status,
|
||||
"kv_cache_scheme": kv_cache_scheme,
|
||||
"global_compression_ratio": global_compression_ratio,
|
||||
"ignore": ignore,
|
||||
**kwargs,
|
||||
}
|
||||
)
|
||||
|
||||
if sparsity_config:
|
||||
self.sparsity_config = SparsityCompressionConfig.load_from_registry(
|
||||
sparsity_config.get("format"), **sparsity_config
|
||||
)
|
||||
|
||||
super().__init__(quant_method=QuantizationMethod.COMPRESSED_TENSORS)
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, config_dict, return_unused_kwargs=False, **kwargs):
|
||||
"""
|
||||
Instantiates a [`CompressedTensorsConfig`] from a Python dictionary of parameters.
|
||||
Optionally unwraps any args from the nested quantization_config
|
||||
|
||||
Args:
|
||||
config_dict (`Dict[str, Any]`):
|
||||
Dictionary that will be used to instantiate the configuration object.
|
||||
return_unused_kwargs (`bool`,*optional*, defaults to `False`):
|
||||
Whether or not to return a list of unused keyword arguments. Used for `from_pretrained` method in
|
||||
`PreTrainedModel`.
|
||||
kwargs (`Dict[str, Any]`):
|
||||
Additional parameters from which to initialize the configuration object.
|
||||
|
||||
Returns:
|
||||
[`QuantizationConfigMixin`]: The configuration object instantiated from those parameters.
|
||||
"""
|
||||
if "quantization_config" in config_dict:
|
||||
config_dict = dict(
|
||||
sparsity_config=config_dict.get("sparsity_config"),
|
||||
**config_dict["quantization_config"],
|
||||
)
|
||||
|
||||
return super().from_dict(config_dict, return_unused_kwargs=return_unused_kwargs, **kwargs)
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Serializes this instance to a Python dictionary. Returns:
|
||||
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
|
||||
"""
|
||||
quantization_config = self.quantization_config.dict() if self.quantization_config is not None else None
|
||||
sparsity_config = self.sparsity_config.dict() if self.sparsity_config is not None else None
|
||||
|
||||
return {
|
||||
"quantization_config": quantization_config,
|
||||
"sparsity_config": sparsity_config,
|
||||
}
|
||||
|
||||
def to_diff_dict(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Removes all attributes from config which correspond to the default config attributes for better readability and
|
||||
serializes to a Python dictionary.
|
||||
Returns:
|
||||
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||
"""
|
||||
config_dict = self.to_dict()
|
||||
|
||||
# get the default config dict
|
||||
default_config_dict = CompressedTensorsConfig().to_dict()
|
||||
|
||||
serializable_config_dict = {}
|
||||
|
||||
# only serialize values that differ from the default config
|
||||
for key, value in config_dict.items():
|
||||
if value != default_config_dict[key]:
|
||||
serializable_config_dict[key] = value
|
||||
|
||||
return serializable_config_dict
|
||||
|
||||
|
||||
@dataclass
|
||||
class FbgemmFp8Config(QuantizationConfigMixin):
|
||||
"""
|
||||
|
||||
0
tests/quantization/compressed_tensor/__init__.py
Normal file
0
tests/quantization/compressed_tensor/__init__.py
Normal file
@@ -0,0 +1,87 @@
|
||||
import gc
|
||||
import unittest
|
||||
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, CompressedTensorsConfig
|
||||
from transformers.testing_utils import require_compressed_tensors, require_torch
|
||||
from transformers.utils import is_torch_available
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
|
||||
@require_compressed_tensors
|
||||
@require_torch
|
||||
class CompressedTensorsTest(unittest.TestCase):
|
||||
tinyllama_w8a16 = "nm-testing/tinyllama-w8a16-dense-hf-quantizer"
|
||||
tinyllama_w4a16 = "nm-testing/tinyllama-w4a16-compressed-hf-quantizer"
|
||||
tinyllama_w8a8 = "nm-testing/tinyllama-w8a8-compressed-hf-quantizer"
|
||||
llama3_8b_fp8 = "nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat"
|
||||
|
||||
prompt = "Paris is the capital of which country?"
|
||||
|
||||
def tearDown(self):
|
||||
gc.collect()
|
||||
torch.cuda.empty_cache()
|
||||
gc.collect()
|
||||
|
||||
def test_config_args(self):
|
||||
with self.assertRaises(ValueError):
|
||||
# passing quant scheme directly is not allowed
|
||||
CompressedTensorsConfig(config_groups={"weights": {"num_bits": 8}})
|
||||
CompressedTensorsConfig(
|
||||
config_groups={"FP8": ["Linear"]},
|
||||
ignore=["lm_head"],
|
||||
quantization_status="frozen",
|
||||
sparsity_config={"format": "dense"},
|
||||
)
|
||||
|
||||
def test_config_to_from_dict(self):
|
||||
config = CompressedTensorsConfig(config_groups={"FP8": ["Linear"]}, sparsity_config={"format": "dense"})
|
||||
config_dict = config.to_dict()
|
||||
config_from_dict = CompressedTensorsConfig.from_dict(config_dict)
|
||||
|
||||
from compressed_tensors import QuantizationConfig, SparsityCompressionConfig
|
||||
|
||||
self.assertIsInstance(config_from_dict.quantization_config, QuantizationConfig)
|
||||
self.assertIsInstance(config_from_dict.sparsity_config, SparsityCompressionConfig)
|
||||
|
||||
def test_tinyllama_w8a8(self):
|
||||
expected_out = "<s> Paris is the capital of which country?\n\n**A) Paris**\n\n**Q** ** Paris is the capital of which country?\n\n**A) Paris**\n\n**Q** ** Paris is the capital of which country"
|
||||
self._test_quantized_model(self.tinyllama_w8a8, expected_out)
|
||||
|
||||
def test_tinyllama_w4a16(self):
|
||||
expected_out = "<s> Paris is the capital of which country?\nAnswer: Paris is the capital of France.\nQuestion: Which country is the capital of which city?\nAnswer: The capital of the city of New York is New York.\nQuestion: Which"
|
||||
self._test_quantized_model(self.tinyllama_w4a16, expected_out)
|
||||
|
||||
def test_tinyllama_w8a16(self):
|
||||
expected_out = "<s> Paris is the capital of which country?\nA. France\nB. Germany\nC. Spain\nD. Italy\nE. Switzerland\nQ10. Which of the following is not a country in the European Union?\nA."
|
||||
self._test_quantized_model(self.tinyllama_w8a16, expected_out)
|
||||
|
||||
def test_llama_8b_fp8(self):
|
||||
expected_out = "<|begin_of_text|>Paris is the capital of which country? France\nWhat is the name of the famous art museum in Paris? The Louvre\nWhat is the name of the famous opera house in Paris? Palais Garnier\nWhat is the name of the"
|
||||
self._test_quantized_model(self.llama3_8b_fp8, expected_out)
|
||||
|
||||
def _test_quantized_model(self, model_name: str, expected_output: str):
|
||||
"""Carry out generation"""
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
device = quantized_model.device
|
||||
self.assertIsNotNone(
|
||||
quantized_model.config.quantization_config,
|
||||
"quantization_config should not be None",
|
||||
)
|
||||
self.assertTrue(
|
||||
any(
|
||||
key
|
||||
for key, tensor in quantized_model.state_dict().items()
|
||||
if "scale" in key and not torch.all(tensor == 1.0)
|
||||
),
|
||||
"quantized model should load a non-trivial scale into the state dict",
|
||||
)
|
||||
inputs = tokenizer(self.prompt, return_tensors="pt").to(device)
|
||||
generated_ids = quantized_model.generate(**inputs, max_length=50, do_sample=False)
|
||||
outputs = tokenizer.batch_decode(generated_ids)
|
||||
|
||||
self.assertIsNotNone(outputs)
|
||||
self.assertEqual(outputs[0], expected_output)
|
||||
Reference in New Issue
Block a user