diff --git a/docs/source/en/main_classes/quantization.md b/docs/source/en/main_classes/quantization.md index 90d56969e4..e37f8cd998 100644 --- a/docs/source/en/main_classes/quantization.md +++ b/docs/source/en/main_classes/quantization.md @@ -29,6 +29,29 @@ If you want to quantize your own pytorch model, check out this [documentation](h Here are the things you can do using `bitsandbytes` integration +### General usage + +You can quantize a model by using the `load_in_8bit` or `load_in_4bit` argument when calling the [`~PreTrainedModel.from_pretrained`] method as long as your model supports loading with 🤗 Accelerate and contains `torch.nn.Linear` layers. This should work for any modality as well. + +```python +from transformers import AutoModelForCausalLM + +model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True) +model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True) +``` + +By default all other modules (e.g. `torch.nn.LayerNorm`) will be converted in `torch.float16`, but if you want to change their `dtype` you can overwrite the `torch_dtype` argument: + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM + +>>> model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32) +>>> model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype +torch.float32 +``` + + ### FP4 quantization #### Requirements @@ -41,7 +64,7 @@ Make sure that you have installed the requirements below before running any of t - Install latest `accelerate` `pip install --upgrade accelerate` -- Install latest `transformers` from source +- Install latest `transformers` `pip install --upgrade transformers` #### Tips and best practices diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py index 5ddea3e34e..7c39733be4 100644 --- a/src/transformers/modeling_utils.py +++ b/src/transformers/modeling_utils.py @@ -2126,10 +2126,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix `True` when there is some disk offload. load_in_8bit (`bool`, *optional*, defaults to `False`): If `True`, will convert the loaded model into mixed-8bit quantized model. To use this feature please - install `bitsandbytes` compiled with your CUDA version by running `pip install -i - https://test.pypi.org/simple/ bitsandbytes-cudaXXX` where XXX is your CUDA version (e.g. 11.6 = 116). - Make also sure that you have enough GPU RAM to store half of the model size since the 8bit modules are - not compiled and adapted for CPUs. + install `bitsandbytes` (`pip install -U bitsandbytes`). + load_in_4bit (`bool`, *optional*, defaults to `False`): + If `True`, will convert the loaded model into 4bit precision quantized model. To use this feature + install the latest version of `bitsandbytes` (`pip install -U bitsandbytes`). quantization_config (`Dict`, *optional*): A dictionary of configuration parameters for the `bitsandbytes` library and loading the model using advanced features such as offloading in fp32 on CPU or on disk.