Add option for ao base configs (#36526)
This commit is contained in:
@@ -20,18 +20,95 @@ Install torchao with the following command.
|
||||
pip install --upgrade torch torchao transformers
|
||||
```
|
||||
|
||||
torchao supports many quantization types for different data types (int4, float8, weight only, etc.), but the Transformers integration only currently supports int8 weight quantization and int8 dynamic quantization of weights.
|
||||
torchao supports many quantization types for different data types (int4, float8, weight only, etc.).
|
||||
Starting with version 0.10.0, torchao provides enhanced flexibility through the `AOBaseConfig` API, allowing for more customized quantization configurations.
|
||||
And full access to the techniques offered in the torchao library.
|
||||
|
||||
You can manually choose the quantization types and settings or automatically select the quantization types.
|
||||
|
||||
<hfoptions id="torchao">
|
||||
<hfoption id="manual">
|
||||
|
||||
|
||||
Create a [`TorchAoConfig`] and specify the quantization type and `group_size` of the weights to quantize. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method.
|
||||
|
||||
> [!TIP]
|
||||
> Run the quantized model on a CPU by changing `device_map` to `"cpu"` and `layout` to `Int4CPULayout()`. This is only available in torchao 0.8.0+.
|
||||
|
||||
In torchao 0.10.0+, you can use the more flexible `AOBaseConfig` approach instead of string identifiers:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int4WeightOnlyConfig
|
||||
|
||||
# Using AOBaseConfig instance (torchao >= 0.10.0)
|
||||
quant_config = Int4WeightOnlyConfig(group_size=128)
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Meta-Llama-3-8B",
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
## Available Quantization Schemes
|
||||
|
||||
TorchAO provides a variety of quantization configurations:
|
||||
|
||||
- `Int4WeightOnlyConfig`
|
||||
- `Int8WeightOnlyConfig`
|
||||
- `Int8DynamicActivationInt8WeightConfig`
|
||||
- `Float8WeightOnlyConfig`
|
||||
|
||||
Each configuration can be further customized with parameters such as `group_size`, `scheme`, and `layout` to optimize for specific hardware and model architectures.
|
||||
|
||||
For a complete list of available configurations, see our [quantization API documentation](https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py).
|
||||
|
||||
> **⚠️ DEPRECATION WARNING**
|
||||
>
|
||||
> Starting with version 0.10.0, the string-based API for quantization configuration (e.g., `TorchAoConfig("int4_weight_only", group_size=128)`) is **deprecated** and will be removed in a future release.
|
||||
>
|
||||
> Please use the new `AOBaseConfig`-based approach instead:
|
||||
>
|
||||
> ```python
|
||||
> # Old way (deprecated)
|
||||
> quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
|
||||
>
|
||||
> # New way (recommended)
|
||||
> from torchao.quantization import Int4WeightOnlyConfig
|
||||
> quant_config = Int4WeightOnlyConfig(group_size=128)
|
||||
> quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
> ```
|
||||
>
|
||||
> The new API offers greater flexibility, better type safety, and access to the full range of features available in torchao.
|
||||
>
|
||||
> ## Migration Guide
|
||||
>
|
||||
> Here's how to migrate from common string identifiers to their `AOBaseConfig` equivalents:
|
||||
>
|
||||
> | Old String API | New `AOBaseConfig` API |
|
||||
> |----------------|------------------------|
|
||||
> | `"int4_weight_only"` | `Int4WeightOnlyConfig()` |
|
||||
> | `"int8_weight_only"` | `Int8WeightOnlyConfig()` |
|
||||
> | `"int8_dynamic_activation_int8_weight"` | `Int8DynamicActivationInt8WeightConfig()` |
|
||||
>
|
||||
> All configuration objects accept parameters for customization (e.g., `group_size`, `scheme`, `layout`).
|
||||
|
||||
|
||||
Below is the API for for torchao < `0.9.0`
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
@@ -78,7 +155,7 @@ print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_toke
|
||||
|
||||
The [autoquant](https://pytorch.org/ao/stable/generated/torchao.quantization.autoquant.html#torchao.quantization.autoquant) API automatically chooses a quantization type for quantizable layers (`nn.Linear`) by micro-benchmarking on input type and shape and compiling a single linear layer.
|
||||
|
||||
Create a [`TorchAoConfig`] and set to `"autoquant"`. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method. Finally, call `finalize_autoquant` on the quantized model to finalize the quantization and log the input shapes.
|
||||
Create a [`TorchAoConfig`] and set to `"autoquant"`. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method. Finally, call `finalize_autoquant` on the quantized model to finalize the quantization and log the input shapes.
|
||||
|
||||
> [!TIP]
|
||||
> Run the quantized model on a CPU by changing `device_map` to `"cpu"` and `layout` to `Int4CPULayout()`. This is only available in torchao 0.8.0+.
|
||||
@@ -131,7 +208,7 @@ print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_toke
|
||||
|
||||
## Serialization
|
||||
|
||||
torchao implements [torch.Tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) for maximum flexibility in supporting new quantized torch.Tensor formats. [Safetensors](https://huggingface.co/docs/safetensors/en/index) serialization and deserialization does not work with torchaco.
|
||||
torchao implements [torch.Tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) for maximum flexibility in supporting new quantized torch.Tensor formats. [Safetensors](https://huggingface.co/docs/safetensors/en/index) serialization and deserialization does not work with torchao.
|
||||
|
||||
To avoid arbitrary user code execution, torchao sets `weights_only=True` in [torch.load](https://pytorch.org/docs/stable/generated/torch.load.html) to ensure only tensors are loaded. Any known user functions can be whitelisted with [add_safe_globals](https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user