[docs] Increase visibility of torch_dtype="auto" (#35067)
* auto-dtype * feedback
This commit is contained in:
@@ -64,7 +64,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
)
|
||||
```
|
||||
|
||||
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
|
||||
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -75,7 +75,7 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
"facebook/opt-350m",
|
||||
quantization_config=quantization_config,
|
||||
torch_dtype=torch.float32
|
||||
torch_dtype="auto"
|
||||
)
|
||||
model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
|
||||
```
|
||||
@@ -112,7 +112,7 @@ model_4bit = AutoModelForCausalLM.from_pretrained(
|
||||
)
|
||||
```
|
||||
|
||||
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
|
||||
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -123,7 +123,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
|
||||
model_4bit = AutoModelForCausalLM.from_pretrained(
|
||||
"facebook/opt-350m",
|
||||
quantization_config=quantization_config,
|
||||
torch_dtype=torch.float32
|
||||
torch_dtype="auto"
|
||||
)
|
||||
model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
|
||||
```
|
||||
@@ -190,6 +190,7 @@ Now load your model with the custom `device_map` and `quantization_config`:
|
||||
```py
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
"bigscience/bloom-1b7",
|
||||
torch_dtype="auto",
|
||||
device_map=device_map,
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
@@ -212,6 +213,7 @@ quantization_config = BitsAndBytesConfig(
|
||||
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype="auto",
|
||||
device_map=device_map,
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
@@ -232,6 +234,7 @@ quantization_config = BitsAndBytesConfig(
|
||||
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
@@ -275,7 +278,7 @@ nf4_config = BitsAndBytesConfig(
|
||||
bnb_4bit_quant_type="nf4",
|
||||
)
|
||||
|
||||
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
|
||||
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", quantization_config=nf4_config)
|
||||
```
|
||||
|
||||
For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.
|
||||
@@ -292,7 +295,7 @@ double_quant_config = BitsAndBytesConfig(
|
||||
bnb_4bit_use_double_quant=True,
|
||||
)
|
||||
|
||||
model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)
|
||||
model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", torch_dtype="auto", quantization_config=double_quant_config)
|
||||
```
|
||||
|
||||
## Dequantizing `bitsandbytes` models
|
||||
|
||||
@@ -33,13 +33,14 @@ pip install --upgrade accelerate fbgemm-gpu torch
|
||||
|
||||
If you are having issues with fbgemm-gpu and torch library, you might need to install the nightly release. You can follow the instruction [here](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch)
|
||||
|
||||
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
|
||||
|
||||
```py
|
||||
from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "meta-llama/Meta-Llama-3-8B"
|
||||
quantization_config = FbgemmFp8Config()
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
input_text = "What are we having for dinner?"
|
||||
|
||||
@@ -42,7 +42,9 @@ pip install optimum-quanto accelerate transformers
|
||||
|
||||
Now you can quantize a model by passing [`QuantoConfig`] object in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it contains `torch.nn.Linear` layers.
|
||||
|
||||
The integration with transformers only supports weights quantization. For the more complex use case such as activation quantization, calibration and quantization aware training, you should use [optimum-quanto](https://github.com/huggingface/optimum-quanto) library instead.
|
||||
The integration with transformers only supports weights quantization. For the more complex use case such as activation quantization, calibration and quantization aware training, you should use [optimum-quanto](https://github.com/huggingface/optimum-quanto) library instead.
|
||||
|
||||
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
|
||||
@@ -50,7 +52,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
|
||||
model_id = "facebook/opt-125m"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
quantization_config = QuantoConfig(weights="int8")
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="cuda:0", quantization_config=quantization_config)
|
||||
```
|
||||
|
||||
Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.
|
||||
|
||||
@@ -19,6 +19,7 @@ Before you begin, make sure the following libraries are installed with their lat
|
||||
pip install --upgrade torch torchao
|
||||
```
|
||||
|
||||
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -28,7 +29,7 @@ model_name = "meta-llama/Meta-Llama-3-8B"
|
||||
# We support int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight
|
||||
# More examples and documentations for arguments can be found in https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
|
||||
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
input_text = "What are we having for dinner?"
|
||||
|
||||
Reference in New Issue
Block a user