[bnb] Minor modifications (#18631)
* bnb minor modifications - refactor documentation - add troubleshooting README - add PyPi library on DockerFile * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * put in one block - put bash instructions in one block * update readme - refactor a bit hardware requirements * change text a bit * Apply suggestions from code review Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * apply suggestions Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * add link to paper * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update tests/mixed_int8/README.md * Apply suggestions from code review * refactor a bit * add instructions Turing & Amperer Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * add A6000 * clarify a bit * remove small part * Update tests/mixed_int8/README.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
This commit is contained in:
@@ -133,46 +133,6 @@ model = AutoModel.from_config(config)
|
||||
|
||||
Due to Pytorch design, this functionality is only available for floating dtypes.
|
||||
|
||||
### `bitsandbytes` integration for Int8 mixed-precision matrix decomposition
|
||||
|
||||
From the paper `GPT3.int8() : 8-bit Matrix Multiplication for Transformers at Scale`, we suport HuggingFace 🤗 integration for all models in the Hub with few lines of code.
|
||||
For models trained in half-precision (aka, either `float16` or `bfloat16`) or full precision. This method aims to reduce `nn.Linear` size by 2 (if trained in half precision) or by 4 if trained in full precision, without affecting too much quality by operating on the outliers in half-precision.
|
||||
This technique is useful and works well for billion scale models (>1B parameters) therefore we advice you to use it only for models of that scale. This method has been tested for 2-billion to 176-billion scale models and supports only PyTorch models.
|
||||
|
||||

|
||||
|
||||
Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) and systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models (>=176B parameters).
|
||||
Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).
|
||||
|
||||
Note also that you would require a GPU to run mixed-8bit models as the kernels has been compiled for GPUs only. Make sure that you have enough GPU RAM to store the quarter (or half if your model is natively in half precision) of the model before using this feature.
|
||||
|
||||
Below are some notes to help you use this module, or follow this demo on Google colab: [](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
|
||||
|
||||
#### Requirements
|
||||
|
||||
- Make sure you run that on a NVIDIA GPU that supports 8-bit tensor cores (Turing or Ampere GPUs - e.g. T4, RTX20s RTX30s, A40-A100). Note that previous generations of NVIDIA GPUs do not support 8-bit tensor cores.
|
||||
- Install the correct version of `bitsandbytes` by running:
|
||||
`pip install -i https://test.pypi.org/simple/ bitsandbytes`
|
||||
- Install `accelerate`:
|
||||
`pip install accelerate`
|
||||
|
||||
#### Running mixed-int8 models
|
||||
|
||||
After carefully installing the required libraries, the way to load your mixed 8-bit model is as follows:
|
||||
```py
|
||||
model_name = "bigscience/bloom-2b5"
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
|
||||
```
|
||||
The implementation supports multi-GPU setup thanks to `accelerate` as backend. If you want to control the GPU memory you want to allocate for each GPU, you can use the `max_memory` argument as follows:
|
||||
(If allocating `1GB` into GPU-0 and `2GB` into GPU-1, you can use `max_memory={0:"1GB", 1:"2GB"}`)
|
||||
```py
|
||||
max_memory_mapping = {0: "1GB", 1: "2GB"}
|
||||
model_name = "bigscience/bloom-3b"
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
## ModuleUtilsMixin
|
||||
|
||||
|
||||
Reference in New Issue
Block a user