[Docs/quantization] Clearer explanation on how things works under the hood. + remove outdated info (#25216)
* clearer explanation on how things works under the hood. * Update docs/source/en/main_classes/quantization.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/main_classes/quantization.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add `load_in_4bit` in `from_pretrained` --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
@@ -29,6 +29,29 @@ If you want to quantize your own pytorch model, check out this [documentation](h
|
|||||||
|
|
||||||
Here are the things you can do using `bitsandbytes` integration
|
Here are the things you can do using `bitsandbytes` integration
|
||||||
|
|
||||||
|
### General usage
|
||||||
|
|
||||||
|
You can quantize a model by using the `load_in_8bit` or `load_in_4bit` argument when calling the [`~PreTrainedModel.from_pretrained`] method as long as your model supports loading with 🤗 Accelerate and contains `torch.nn.Linear` layers. This should work for any modality as well.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
|
model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True)
|
||||||
|
model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
By default all other modules (e.g. `torch.nn.LayerNorm`) will be converted in `torch.float16`, but if you want to change their `dtype` you can overwrite the `torch_dtype` argument:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> import torch
|
||||||
|
>>> from transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
|
>>> model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32)
|
||||||
|
>>> model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
|
||||||
|
torch.float32
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
### FP4 quantization
|
### FP4 quantization
|
||||||
|
|
||||||
#### Requirements
|
#### Requirements
|
||||||
@@ -41,7 +64,7 @@ Make sure that you have installed the requirements below before running any of t
|
|||||||
- Install latest `accelerate`
|
- Install latest `accelerate`
|
||||||
`pip install --upgrade accelerate`
|
`pip install --upgrade accelerate`
|
||||||
|
|
||||||
- Install latest `transformers` from source
|
- Install latest `transformers`
|
||||||
`pip install --upgrade transformers`
|
`pip install --upgrade transformers`
|
||||||
|
|
||||||
#### Tips and best practices
|
#### Tips and best practices
|
||||||
|
|||||||
@@ -2126,10 +2126,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
|
|||||||
`True` when there is some disk offload.
|
`True` when there is some disk offload.
|
||||||
load_in_8bit (`bool`, *optional*, defaults to `False`):
|
load_in_8bit (`bool`, *optional*, defaults to `False`):
|
||||||
If `True`, will convert the loaded model into mixed-8bit quantized model. To use this feature please
|
If `True`, will convert the loaded model into mixed-8bit quantized model. To use this feature please
|
||||||
install `bitsandbytes` compiled with your CUDA version by running `pip install -i
|
install `bitsandbytes` (`pip install -U bitsandbytes`).
|
||||||
https://test.pypi.org/simple/ bitsandbytes-cudaXXX` where XXX is your CUDA version (e.g. 11.6 = 116).
|
load_in_4bit (`bool`, *optional*, defaults to `False`):
|
||||||
Make also sure that you have enough GPU RAM to store half of the model size since the 8bit modules are
|
If `True`, will convert the loaded model into 4bit precision quantized model. To use this feature
|
||||||
not compiled and adapted for CPUs.
|
install the latest version of `bitsandbytes` (`pip install -U bitsandbytes`).
|
||||||
quantization_config (`Dict`, *optional*):
|
quantization_config (`Dict`, *optional*):
|
||||||
A dictionary of configuration parameters for the `bitsandbytes` library and loading the model using
|
A dictionary of configuration parameters for the `bitsandbytes` library and loading the model using
|
||||||
advanced features such as offloading in fp32 on CPU or on disk.
|
advanced features such as offloading in fp32 on CPU or on disk.
|
||||||
|
|||||||
Reference in New Issue
Block a user