Enable BNB multi-backend support (#31098)

* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6d6b34f45a5745a736ba57282405cfaa61.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
jiqing-feng
2024-09-24 17:40:56 +08:00
committed by GitHub
parent e15687fffe
commit 11c27dd331
20 changed files with 436 additions and 100 deletions

View File

@@ -181,7 +181,7 @@ for every matrix multiplication. Dequantization and re-quantization is performed
Therefore, inference time is often **not** reduced when using quantized weights, but rather increases. Therefore, inference time is often **not** reduced when using quantized weights, but rather increases.
Enough theory, let's give it a try! To quantize the weights with Transformers, you need to make sure that Enough theory, let's give it a try! To quantize the weights with Transformers, you need to make sure that
the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) library is installed. the [`bitsandbytes`](https://github.com/bitsandbytes-foundation/bitsandbytes) library is installed.
```bash ```bash
!pip install bitsandbytes !pip install bitsandbytes

View File

@@ -128,7 +128,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
### Quantization using Bitsandbytes ### Quantization using Bitsandbytes
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with: The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
<Tip>
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
</Tip>
Simply change the snippet above with:
```python ```python
from transformers import ChameleonForConditionalGeneration, BitsAndBytesConfig from transformers import ChameleonForConditionalGeneration, BitsAndBytesConfig

View File

@@ -233,7 +233,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
### Quantization using Bitsandbytes ### Quantization using Bitsandbytes
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with: The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes`, and to have access to a GPU/accelerator that is supported by the library.
<Tip>
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
</Tip>
Simply change the snippet above with:
```python ```python
from transformers import LlavaNextForConditionalGeneration, BitsAndBytesConfig from transformers import LlavaNextForConditionalGeneration, BitsAndBytesConfig

View File

@@ -205,7 +205,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. This allows for efficient deployment on resource-constrained cases. The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. This allows for efficient deployment on resource-constrained cases.
First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a CUDA compatible GPU device. Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below: First, make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
<Tip>
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
</Tip>
Then simply load the quantized model by adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
```python ```python

View File

@@ -264,9 +264,19 @@ processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spac
## Model optimization ## Model optimization
### Quantization using Bitsandbytes ### Quantization using bitsandbytes
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with: The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a GPU/accelerator that is supported by the library.
<Tip>
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
</Tip>
Simply change the snippet above with:
```python ```python
from transformers import LlavaOnevisionForConditionalGeneration, BitsAndBytesConfig from transformers import LlavaOnevisionForConditionalGeneration, BitsAndBytesConfig

View File

@@ -141,7 +141,7 @@ The Flash Attention-2 model uses also a more memory efficient cache slicing mech
As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required. As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required.
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization.md) for other quantization methods): Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization.md) for alternative quantization methods):
```python ```python
>>> import torch >>> import torch

View File

@@ -139,7 +139,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases. The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases.
First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a CUDA compatible GPU device. Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below: First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
<Tip>
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
</Tip>
Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
```python ```python

View File

@@ -233,7 +233,7 @@ Let's look at the details.
**Optimizer States:** **Optimizer States:**
- 8 bytes * number of parameters for normal AdamW (maintains 2 states) - 8 bytes * number of parameters for normal AdamW (maintains 2 states)
- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) - 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes)
- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state) - 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)
**Gradients** **Gradients**

View File

@@ -284,7 +284,7 @@ training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bn
However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated. However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated.
First, follow the installation guide in the GitHub [repo](https://github.com/TimDettmers/bitsandbytes) to install the `bitsandbytes` library First, follow the installation guide in the GitHub [repo](https://github.com/bitsandbytes-foundation/bitsandbytes) to install the `bitsandbytes` library
that implements the 8-bit Adam optimizer. that implements the 8-bit Adam optimizer.
Next you need to initialize the optimizer. This involves two steps: Next you need to initialize the optimizer. This involves two steps:

View File

@@ -38,6 +38,14 @@ pip install --upgrade accelerate transformers
</hfoption> </hfoption>
</hfoptions> </hfoptions>
<Tip>
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
</Tip>
Now you can quantize a model by passing a `BitsAndBytesConfig` to [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers. Now you can quantize a model by passing a `BitsAndBytesConfig` to [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers.
<hfoptions id="bnb"> <hfoptions id="bnb">

View File

@@ -49,7 +49,7 @@ Use the table below to help you decide which quantization method to use.
|-------------------------------------|-------------------------|-----|----------|----------------|-----------------------|-------------------------|----------------|-------------------------------------|--------------|------------------------|---------------------------------------------| |-------------------------------------|-------------------------|-----|----------|----------------|-----------------------|-------------------------|----------------|-------------------------------------|--------------|------------------------|---------------------------------------------|
| [AQLM](./aqlm) | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 1 / 2 | 🟢 | 🟢 | 🟢 | https://github.com/Vahe1994/AQLM | | [AQLM](./aqlm) | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 1 / 2 | 🟢 | 🟢 | 🟢 | https://github.com/Vahe1994/AQLM |
| [AWQ](./awq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | ? | 4 | 🟢 | 🟢 | 🟢 | https://github.com/casper-hansen/AutoAWQ | | [AWQ](./awq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | ? | 4 | 🟢 | 🟢 | 🟢 | https://github.com/casper-hansen/AutoAWQ |
| [bitsandbytes](./bitsandbytes) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 4 / 8 | 🟢 | 🟢 | 🟢 | https://github.com/TimDettmers/bitsandbytes | | [bitsandbytes](./bitsandbytes) | 🟢 | 🟡 * | 🟢 | 🟡 * | 🔴 ** | 🔴 (soon!) | 4 / 8 | 🟢 | 🟢 | 🟢 | https://github.com/bitsandbytes-foundation/bitsandbytes |
| [EETQ](./eetq) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | ? | 8 | 🟢 | 🟢 | 🟢 | https://github.com/NetEase-FuXi/EETQ | | [EETQ](./eetq) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | ? | 8 | 🟢 | 🟢 | 🟢 | https://github.com/NetEase-FuXi/EETQ |
| GGUF / GGML (llama.cpp) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 1 - 8 | 🔴 | [See GGUF section](../gguf) | [See GGUF section](../gguf) | https://github.com/ggerganov/llama.cpp | | GGUF / GGML (llama.cpp) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 1 - 8 | 🔴 | [See GGUF section](../gguf) | [See GGUF section](../gguf) | https://github.com/ggerganov/llama.cpp |
| [GPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 2 - 3 - 4 - 8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ | | [GPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 2 - 3 - 4 - 8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ |
@@ -57,3 +57,17 @@ Use the table below to help you decide which quantization method to use.
| [Quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 2 / 4 / 8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/quanto | | [Quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 2 / 4 / 8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/quanto |
| [FBGEMM_FP8](./fbgemm_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM | | [FBGEMM_FP8](./fbgemm_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
| [torchao](./torchao.md) | 🟢 | | 🟢 | 🔴 | partial support (int4 weight only) | | 4 / 8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao | | [torchao](./torchao.md) | 🟢 | | 🟢 | 🔴 | partial support (int4 weight only) | | 4 / 8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
<Tip>
\* bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
</Tip>
<Tip>
\** bitsandbytes is seeking contributors to help develop and lead the Apple Silicon backend. Interested? Contact them directly via their repo. Stipends may be available through sponsorships.
</Tip>

View File

@@ -31,6 +31,7 @@ _import_structure = {
"replace_with_bnb_linear", "replace_with_bnb_linear",
"set_module_8bit_tensor_to_device", "set_module_8bit_tensor_to_device",
"set_module_quantized_tensor_to_device", "set_module_quantized_tensor_to_device",
"validate_bnb_backend_availability",
], ],
"deepspeed": [ "deepspeed": [
"HfDeepSpeedConfig", "HfDeepSpeedConfig",
@@ -124,6 +125,7 @@ if TYPE_CHECKING:
replace_with_bnb_linear, replace_with_bnb_linear,
set_module_8bit_tensor_to_device, set_module_8bit_tensor_to_device,
set_module_quantized_tensor_to_device, set_module_quantized_tensor_to_device,
validate_bnb_backend_availability,
) )
from .deepspeed import ( from .deepspeed import (
HfDeepSpeedConfig, HfDeepSpeedConfig,

View File

@@ -6,7 +6,15 @@ from inspect import signature
from packaging import version from packaging import version
from ..utils import is_accelerate_available, is_bitsandbytes_available, logging from ..utils import (
get_available_devices,
is_accelerate_available,
is_bitsandbytes_available,
is_bitsandbytes_multi_backend_available,
is_ipex_available,
is_torch_available,
logging,
)
if is_bitsandbytes_available(): if is_bitsandbytes_available():
@@ -332,7 +340,7 @@ def get_keys_to_not_convert(model):
# Copied from PEFT: https://github.com/huggingface/peft/blob/47b3712898539569c02ec5b3ed4a6c36811331a1/src/peft/utils/integrations.py#L41 # Copied from PEFT: https://github.com/huggingface/peft/blob/47b3712898539569c02ec5b3ed4a6c36811331a1/src/peft/utils/integrations.py#L41
def dequantize_bnb_weight(weight: "torch.nn.Parameter", state=None): def dequantize_bnb_weight(weight: "torch.nn.Parameter", dtype: "torch.dtype", state=None):
""" """
Helper function to dequantize 4bit or 8bit bnb weights. Helper function to dequantize 4bit or 8bit bnb weights.
@@ -350,7 +358,7 @@ def dequantize_bnb_weight(weight: "torch.nn.Parameter", state=None):
logger.warning_once( logger.warning_once(
f"The model is going to be dequantized in {output_tensor.dtype} - if you want to upcast it to another dtype, make sure to pass the desired dtype when quantizing the model through `bnb_4bit_quant_type` argument of `BitsAndBytesConfig`" f"The model is going to be dequantized in {output_tensor.dtype} - if you want to upcast it to another dtype, make sure to pass the desired dtype when quantizing the model through `bnb_4bit_quant_type` argument of `BitsAndBytesConfig`"
) )
return output_tensor return output_tensor.to(dtype)
if state.SCB is None: if state.SCB is None:
state.SCB = weight.SCB state.SCB = weight.SCB
@@ -361,7 +369,7 @@ def dequantize_bnb_weight(weight: "torch.nn.Parameter", state=None):
if state.CxB is None: if state.CxB is None:
state.CxB, state.SB = bnb.functional.transform(weight.data, to_order=state.formatB) state.CxB, state.SB = bnb.functional.transform(weight.data, to_order=state.formatB)
out32, Sout32 = bnb.functional.igemmlt(im, state.CxB, Sim, state.SB) out32, Sout32 = bnb.functional.igemmlt(im, state.CxB, Sim, state.SB)
return bnb.functional.mm_dequant(out32, Sout32, SCim, state.SCB, bias=None).t() return bnb.functional.mm_dequant(out32, Sout32, SCim, state.SCB, bias=None).t().to(dtype)
def _create_accelerate_new_hook(old_hook): def _create_accelerate_new_hook(old_hook):
@@ -383,6 +391,7 @@ def _create_accelerate_new_hook(old_hook):
def _dequantize_and_replace( def _dequantize_and_replace(
model, model,
dtype,
modules_to_not_convert=None, modules_to_not_convert=None,
current_key_name=None, current_key_name=None,
quantization_config=None, quantization_config=None,
@@ -422,7 +431,7 @@ def _dequantize_and_replace(
else: else:
state = None state = None
new_module.weight = torch.nn.Parameter(dequantize_bnb_weight(module.weight, state)) new_module.weight = torch.nn.Parameter(dequantize_bnb_weight(module.weight, dtype, state))
if bias is not None: if bias is not None:
new_module.bias = bias new_module.bias = bias
@@ -441,6 +450,7 @@ def _dequantize_and_replace(
if len(list(module.children())) > 0: if len(list(module.children())) > 0:
_, has_been_replaced = _dequantize_and_replace( _, has_been_replaced = _dequantize_and_replace(
module, module,
dtype,
modules_to_not_convert, modules_to_not_convert,
current_key_name, current_key_name,
quantization_config, quantization_config,
@@ -458,6 +468,7 @@ def dequantize_and_replace(
): ):
model, has_been_replaced = _dequantize_and_replace( model, has_been_replaced = _dequantize_and_replace(
model, model,
model.dtype,
modules_to_not_convert=modules_to_not_convert, modules_to_not_convert=modules_to_not_convert,
quantization_config=quantization_config, quantization_config=quantization_config,
) )
@@ -468,3 +479,80 @@ def dequantize_and_replace(
) )
return model return model
def _validate_bnb_multi_backend_availability(raise_exception):
import bitsandbytes as bnb
bnb_supported_devices = getattr(bnb, "supported_torch_devices", set())
available_devices = get_available_devices()
if available_devices == {"cpu"} and not is_ipex_available():
from importlib.util import find_spec
if find_spec("intel_extension_for_pytorch"):
logger.warning(
"You have Intel IPEX installed but if you're intending to use it for CPU, it might not have the right version. Be sure to double check that your PyTorch and IPEX installs are compatible."
)
available_devices.discard("cpu") # Only Intel CPU is supported by BNB at the moment
if not available_devices.intersection(bnb_supported_devices):
if raise_exception:
bnb_supported_devices_with_info = set( # noqa: C401
'"cpu" (needs an Intel CPU and intel_extension_for_pytorch installed and compatible with the PyTorch version)'
if device == "cpu"
else device
for device in bnb_supported_devices
)
err_msg = (
f"None of the available devices `available_devices = {available_devices or None}` are supported by the bitsandbytes version you have installed: `bnb_supported_devices = {bnb_supported_devices_with_info}`. "
"Please check the docs to see if the backend you intend to use is available and how to install it: https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend"
)
logger.error(err_msg)
raise RuntimeError(err_msg)
logger.warning("No supported devices found for bitsandbytes multi-backend.")
return False
logger.debug("Multi-backend validation successful.")
return True
def _validate_bnb_cuda_backend_availability(raise_exception):
if not is_torch_available():
return False
import torch
if not torch.cuda.is_available():
log_msg = (
"CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. "
"Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend"
)
if raise_exception:
logger.error(log_msg)
raise RuntimeError(log_msg)
logger.warning(log_msg)
return False
logger.debug("CUDA backend validation successful.")
return True
def validate_bnb_backend_availability(raise_exception=False):
"""
Validates if the available devices are supported by bitsandbytes, optionally raising an exception if not.
"""
if not is_bitsandbytes_available():
if importlib.util.find_spec("bitsandbytes") and version.parse(
importlib.metadata.version("bitsandbytes")
) < version.parse("0.43.1"):
return _validate_bnb_cuda_backend_availability(raise_exception)
return False
if is_bitsandbytes_multi_backend_available():
return _validate_bnb_multi_backend_availability(raise_exception)
return _validate_bnb_cuda_backend_availability(raise_exception)

View File

@@ -29,6 +29,7 @@ from ..utils import (
is_accelerate_available, is_accelerate_available,
is_bitsandbytes_available, is_bitsandbytes_available,
is_torch_available, is_torch_available,
is_torch_xpu_available,
logging, logging,
) )
@@ -65,8 +66,6 @@ class Bnb4BitHfQuantizer(HfQuantizer):
self.modules_to_not_convert = self.quantization_config.llm_int8_skip_modules self.modules_to_not_convert = self.quantization_config.llm_int8_skip_modules
def validate_environment(self, *args, **kwargs): def validate_environment(self, *args, **kwargs):
if not torch.cuda.is_available():
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
if not is_accelerate_available(): if not is_accelerate_available():
raise ImportError( raise ImportError(
f"Using `bitsandbytes` 4-bit quantization requires Accelerate: `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`" f"Using `bitsandbytes` 4-bit quantization requires Accelerate: `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`"
@@ -76,6 +75,12 @@ class Bnb4BitHfQuantizer(HfQuantizer):
"Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`" "Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`"
) )
from ..integrations import validate_bnb_backend_availability
from ..utils import is_bitsandbytes_multi_backend_available
bnb_multibackend_is_enabled = is_bitsandbytes_multi_backend_available()
validate_bnb_backend_availability(raise_exception=True)
if kwargs.get("from_tf", False) or kwargs.get("from_flax", False): if kwargs.get("from_tf", False) or kwargs.get("from_flax", False):
raise ValueError( raise ValueError(
"Converting into 4-bit or 8-bit weights from tf/flax weights is currently not supported, please make" "Converting into 4-bit or 8-bit weights from tf/flax weights is currently not supported, please make"
@@ -91,7 +96,9 @@ class Bnb4BitHfQuantizer(HfQuantizer):
device_map_without_lm_head = { device_map_without_lm_head = {
key: device_map[key] for key in device_map.keys() if key not in self.modules_to_not_convert key: device_map[key] for key in device_map.keys() if key not in self.modules_to_not_convert
} }
if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values(): if set(device_map.values()) == {"cpu"} and bnb_multibackend_is_enabled:
pass
elif "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
raise ValueError( raise ValueError(
"Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the " "Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the "
"quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules " "quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules "
@@ -255,10 +262,15 @@ class Bnb4BitHfQuantizer(HfQuantizer):
# Copied from transformers.quantizers.quantizer_bnb_8bit.Bnb8BitHfQuantizer.update_device_map # Copied from transformers.quantizers.quantizer_bnb_8bit.Bnb8BitHfQuantizer.update_device_map
def update_device_map(self, device_map): def update_device_map(self, device_map):
if device_map is None: if device_map is None:
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available():
device_map = {"": torch.cuda.current_device()}
elif is_torch_xpu_available():
device_map = {"": f"xpu:{torch.xpu.current_device()}"}
else:
device_map = {"": "cpu"}
logger.info( logger.info(
"The device_map was not initialized. " "The device_map was not initialized. "
"Setting device_map to {'':torch.cuda.current_device()}. " f"Setting device_map to {device_map}. "
"If you want to use the model for inference, please set device_map ='auto' " "If you want to use the model for inference, please set device_map ='auto' "
) )
return device_map return device_map

View File

@@ -27,6 +27,7 @@ from ..utils import (
is_accelerate_available, is_accelerate_available,
is_bitsandbytes_available, is_bitsandbytes_available,
is_torch_available, is_torch_available,
is_torch_xpu_available,
logging, logging,
) )
from .quantizers_utils import get_module_from_name from .quantizers_utils import get_module_from_name
@@ -64,9 +65,6 @@ class Bnb8BitHfQuantizer(HfQuantizer):
self.modules_to_not_convert = self.quantization_config.llm_int8_skip_modules self.modules_to_not_convert = self.quantization_config.llm_int8_skip_modules
def validate_environment(self, *args, **kwargs): def validate_environment(self, *args, **kwargs):
if not torch.cuda.is_available():
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
if not is_accelerate_available(): if not is_accelerate_available():
raise ImportError( raise ImportError(
f"Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`" f"Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`"
@@ -76,6 +74,12 @@ class Bnb8BitHfQuantizer(HfQuantizer):
"Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`" "Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`"
) )
from ..integrations import validate_bnb_backend_availability
from ..utils import is_bitsandbytes_multi_backend_available
bnb_multibackend_is_enabled = is_bitsandbytes_multi_backend_available()
validate_bnb_backend_availability(raise_exception=True)
if kwargs.get("from_tf", False) or kwargs.get("from_flax", False): if kwargs.get("from_tf", False) or kwargs.get("from_flax", False):
raise ValueError( raise ValueError(
"Converting into 4-bit or 8-bit weights from tf/flax weights is currently not supported, please make" "Converting into 4-bit or 8-bit weights from tf/flax weights is currently not supported, please make"
@@ -91,7 +95,9 @@ class Bnb8BitHfQuantizer(HfQuantizer):
device_map_without_lm_head = { device_map_without_lm_head = {
key: device_map[key] for key in device_map.keys() if key not in self.modules_to_not_convert key: device_map[key] for key in device_map.keys() if key not in self.modules_to_not_convert
} }
if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values(): if set(device_map.values()) == {"cpu"} and bnb_multibackend_is_enabled:
pass
elif "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
raise ValueError( raise ValueError(
"Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the " "Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the "
"quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules " "quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules "
@@ -127,10 +133,15 @@ class Bnb8BitHfQuantizer(HfQuantizer):
def update_device_map(self, device_map): def update_device_map(self, device_map):
if device_map is None: if device_map is None:
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available():
device_map = {"": torch.cuda.current_device()}
elif is_torch_xpu_available():
device_map = {"": f"xpu:{torch.xpu.current_device()}"}
else:
device_map = {"": "cpu"}
logger.info( logger.info(
"The device_map was not initialized. " "The device_map was not initialized. "
"Setting device_map to {'':torch.cuda.current_device()}. " f"Setting device_map to {device_map}. "
"If you want to use the model for inference, please set device_map ='auto' " "If you want to use the model for inference, please set device_map ='auto' "
) )
return device_map return device_map

View File

@@ -61,6 +61,7 @@ from .utils import (
is_auto_gptq_available, is_auto_gptq_available,
is_av_available, is_av_available,
is_bitsandbytes_available, is_bitsandbytes_available,
is_bitsandbytes_multi_backend_available,
is_bs4_available, is_bs4_available,
is_cv2_available, is_cv2_available,
is_cython_available, is_cython_available,
@@ -224,6 +225,17 @@ _run_agent_tests = parse_flag_from_env("RUN_AGENT_TESTS", default=False)
_run_third_party_device_tests = parse_flag_from_env("RUN_THIRD_PARTY_DEVICE_TESTS", default=False) _run_third_party_device_tests = parse_flag_from_env("RUN_THIRD_PARTY_DEVICE_TESTS", default=False)
def get_device_count():
import torch
if is_torch_xpu_available():
num_devices = torch.xpu.device_count()
else:
num_devices = torch.cuda.device_count()
return num_devices
def is_pt_tf_cross_test(test_case): def is_pt_tf_cross_test(test_case):
""" """
Decorator marking a test as a test that control interactions between PyTorch and TensorFlow. Decorator marking a test as a test that control interactions between PyTorch and TensorFlow.
@@ -331,6 +343,29 @@ def tooslow(test_case):
return unittest.skip(reason="test is too slow")(test_case) return unittest.skip(reason="test is too slow")(test_case)
def skip_if_not_implemented(test_func):
@functools.wraps(test_func)
def wrapper(*args, **kwargs):
try:
return test_func(*args, **kwargs)
except NotImplementedError as e:
raise unittest.SkipTest(f"Test skipped due to NotImplementedError: {e}")
return wrapper
def apply_skip_if_not_implemented(cls):
"""
Class decorator to apply @skip_if_not_implemented to all test methods.
"""
for attr_name in dir(cls):
if attr_name.startswith("test_"):
attr = getattr(cls, attr_name)
if callable(attr):
setattr(cls, attr_name, skip_if_not_implemented(attr))
return cls
def custom_tokenizers(test_case): def custom_tokenizers(test_case):
""" """
Decorator marking a test for a custom tokenizer. Decorator marking a test for a custom tokenizer.
@@ -738,9 +773,9 @@ def require_torch_multi_gpu(test_case):
if not is_torch_available(): if not is_torch_available():
return unittest.skip(reason="test requires PyTorch")(test_case) return unittest.skip(reason="test requires PyTorch")(test_case)
import torch device_count = get_device_count()
return unittest.skipUnless(torch.cuda.device_count() > 1, "test requires multiple GPUs")(test_case) return unittest.skipUnless(device_count > 1, "test requires multiple GPUs")(test_case)
def require_torch_multi_accelerator(test_case): def require_torch_multi_accelerator(test_case):
@@ -947,6 +982,15 @@ def require_torch_gpu(test_case):
return unittest.skipUnless(torch_device == "cuda", "test requires CUDA")(test_case) return unittest.skipUnless(torch_device == "cuda", "test requires CUDA")(test_case)
def require_torch_gpu_if_bnb_not_multi_backend_enabled(test_case):
"""
Decorator marking a test that requires a GPU if bitsandbytes multi-backend feature is not enabled.
"""
if is_bitsandbytes_available() and is_bitsandbytes_multi_backend_available():
return test_case
return require_torch_gpu(test_case)
def require_torch_accelerator(test_case): def require_torch_accelerator(test_case):
"""Decorator marking a test that requires an accessible accelerator and PyTorch.""" """Decorator marking a test that requires an accessible accelerator and PyTorch."""
return unittest.skipUnless(torch_device is not None and torch_device != "cpu", "test requires accelerator")( return unittest.skipUnless(torch_device is not None and torch_device != "cpu", "test requires accelerator")(

View File

@@ -15,6 +15,9 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from functools import lru_cache
from typing import FrozenSet
from huggingface_hub import get_full_repo_name # for backward compatibility from huggingface_hub import get_full_repo_name # for backward compatibility
from huggingface_hub.constants import HF_HUB_DISABLE_TELEMETRY as DISABLE_TELEMETRY # for backward compatibility from huggingface_hub.constants import HF_HUB_DISABLE_TELEMETRY as DISABLE_TELEMETRY # for backward compatibility
from packaging import version from packaging import version
@@ -118,6 +121,7 @@ from .import_utils import (
is_auto_gptq_available, is_auto_gptq_available,
is_av_available, is_av_available,
is_bitsandbytes_available, is_bitsandbytes_available,
is_bitsandbytes_multi_backend_available,
is_bs4_available, is_bs4_available,
is_coloredlogs_available, is_coloredlogs_available,
is_cv2_available, is_cv2_available,
@@ -277,3 +281,31 @@ def check_min_version(min_version):
+ "Check out https://github.com/huggingface/transformers/tree/main/examples#important-note for the examples corresponding to other " + "Check out https://github.com/huggingface/transformers/tree/main/examples#important-note for the examples corresponding to other "
"versions of HuggingFace Transformers." "versions of HuggingFace Transformers."
) )
@lru_cache()
def get_available_devices() -> FrozenSet[str]:
"""
Returns a frozenset of devices available for the current PyTorch installation.
"""
devices = {"cpu"} # `cpu` is always supported as a device in PyTorch
if is_torch_cuda_available():
devices.add("cuda")
if is_torch_mps_available():
devices.add("mps")
if is_torch_xpu_available():
devices.add("xpu")
if is_torch_npu_available():
devices.add("npu")
if is_torch_mlu_available():
devices.add("mlu")
if is_torch_musa_available():
devices.add("musa")
return frozenset(devices)

View File

@@ -849,15 +849,29 @@ def is_torch_xpu_available(check_device=False):
return hasattr(torch, "xpu") and torch.xpu.is_available() return hasattr(torch, "xpu") and torch.xpu.is_available()
@lru_cache()
def is_bitsandbytes_available(): def is_bitsandbytes_available():
if not is_torch_available(): if not is_torch_available() or not _bitsandbytes_available:
return False return False
# bitsandbytes throws an error if cuda is not available
# let's avoid that by adding a simple check
import torch import torch
return _bitsandbytes_available and torch.cuda.is_available() # `bitsandbytes` versions older than 0.43.1 eagerly require CUDA at import time,
# so those versions of the library are practically only available when CUDA is too.
if version.parse(importlib.metadata.version("bitsandbytes")) < version.parse("0.43.1"):
return torch.cuda.is_available()
# Newer versions of `bitsandbytes` can be imported on systems without CUDA.
return True
def is_bitsandbytes_multi_backend_available() -> bool:
if not is_bitsandbytes_available():
return False
import bitsandbytes as bnb
return "multi_backend" in getattr(bnb, "features", set())
def is_flash_attn_2_available(): def is_flash_attn_2_available():

View File

@@ -30,12 +30,13 @@ from transformers import (
pipeline, pipeline,
) )
from transformers.testing_utils import ( from transformers.testing_utils import (
apply_skip_if_not_implemented,
is_bitsandbytes_available, is_bitsandbytes_available,
is_torch_available, is_torch_available,
require_accelerate, require_accelerate,
require_bitsandbytes, require_bitsandbytes,
require_torch, require_torch,
require_torch_gpu, require_torch_gpu_if_bnb_not_multi_backend_enabled,
require_torch_multi_gpu, require_torch_multi_gpu,
slow, slow,
torch_device, torch_device,
@@ -85,7 +86,7 @@ if is_bitsandbytes_available():
@require_bitsandbytes @require_bitsandbytes
@require_accelerate @require_accelerate
@require_torch @require_torch
@require_torch_gpu @require_torch_gpu_if_bnb_not_multi_backend_enabled
@slow @slow
class Base4bitTest(unittest.TestCase): class Base4bitTest(unittest.TestCase):
# We keep the constants inside the init function and model loading inside setUp function # We keep the constants inside the init function and model loading inside setUp function
@@ -111,6 +112,7 @@ class Base4bitTest(unittest.TestCase):
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name) self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
@apply_skip_if_not_implemented
class Bnb4BitTest(Base4bitTest): class Bnb4BitTest(Base4bitTest):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -206,7 +208,7 @@ class Bnb4BitTest(Base4bitTest):
tok = AutoTokenizer.from_pretrained(model_id) tok = AutoTokenizer.from_pretrained(model_id)
text = "Hello my name is" text = "Hello my name is"
input_ids = tok.encode(text, return_tensors="pt").to(0) input_ids = tok.encode(text, return_tensors="pt").to(torch_device)
_ = model.generate(input_ids, max_new_tokens=30) _ = model.generate(input_ids, max_new_tokens=30)
@@ -217,7 +219,9 @@ class Bnb4BitTest(Base4bitTest):
the same output across GPUs. So we'll generate few tokens (5-10) and check their output. the same output across GPUs. So we'll generate few tokens (5-10) and check their output.
""" """
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = self.model_4bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = self.model_4bit.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -234,7 +238,7 @@ class Bnb4BitTest(Base4bitTest):
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_4bit_from_config.generate( output_sequences = model_4bit_from_config.generate(
input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10 input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
) )
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -252,7 +256,9 @@ class Bnb4BitTest(Base4bitTest):
model_4bit.dequantize() model_4bit.dequantize()
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_4bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = model_4bit.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -267,15 +273,18 @@ class Bnb4BitTest(Base4bitTest):
self.assertEqual(self.model_4bit.device.type, "cpu") self.assertEqual(self.model_4bit.device.type, "cpu")
self.assertAlmostEqual(self.model_4bit.get_memory_footprint(), mem_before) self.assertAlmostEqual(self.model_4bit.get_memory_footprint(), mem_before)
# Move back to CUDA device if torch.cuda.is_available():
self.model_4bit.to(0) # Move back to CUDA device
self.assertEqual(self.model_4bit.device, torch.device(0)) self.model_4bit.to("cuda")
self.assertAlmostEqual(self.model_4bit.get_memory_footprint(), mem_before) self.assertEqual(self.model_4bit.device.type, "cuda")
self.assertAlmostEqual(self.model_4bit.get_memory_footprint(), mem_before)
def test_device_and_dtype_assignment(self): def test_device_and_dtype_assignment(self):
r""" r"""
Test whether trying to cast (or assigning a device to) a model after converting it in 4-bit will throw an error. Test whether attempting to change the device or cast the dtype of a model
Checks also if other models are casted correctly. after converting it to 4-bit precision will raise an appropriate error.
The test ensures that such operations are prohibited on 4-bit models
to prevent invalid conversions.
""" """
# Moving with `to` or `cuda` is not supported with versions < 0.43.2. # Moving with `to` or `cuda` is not supported with versions < 0.43.2.
@@ -297,25 +306,24 @@ class Bnb4BitTest(Base4bitTest):
self.model_4bit.to(torch.float16) self.model_4bit.to(torch.float16)
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
# Tries with a `dtype` and `device` # Tries to cast the 4-bit model to float32 using `float()`
self.model_4bit.to(device="cuda:0", dtype=torch.float16)
with self.assertRaises(ValueError):
# Tries with a cast
self.model_4bit.float() self.model_4bit.float()
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
# Tries with a cast # Tries to cast the 4-bit model to float16 using `half()`
self.model_4bit.half() self.model_4bit.half()
# Test if we did not break anything # Test if we did not break anything
self.model_4bit.to(torch.device(torch_device))
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
self.model_fp16 = self.model_fp16.to(torch.float32) self.model_fp16 = self.model_fp16.to(torch.float32)
_ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) _ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
# Check that this does not throw an error if torch.cuda.is_available():
_ = self.model_fp16.cuda() # Check that this does not throw an error
_ = self.model_fp16.cuda()
# Check this does not throw an error # Check this does not throw an error
_ = self.model_fp16.to("cpu") _ = self.model_fp16.to("cpu")
@@ -344,8 +352,9 @@ class Bnb4BitTest(Base4bitTest):
@require_bitsandbytes @require_bitsandbytes
@require_accelerate @require_accelerate
@require_torch @require_torch
@require_torch_gpu @require_torch_gpu_if_bnb_not_multi_backend_enabled
@slow @slow
@apply_skip_if_not_implemented
class Bnb4BitT5Test(unittest.TestCase): class Bnb4BitT5Test(unittest.TestCase):
@classmethod @classmethod
def setUpClass(cls): def setUpClass(cls):
@@ -375,14 +384,14 @@ class Bnb4BitT5Test(unittest.TestCase):
# test with `google-t5/t5-small` # test with `google-t5/t5-small`
model = T5ForConditionalGeneration.from_pretrained(self.model_name, load_in_4bit=True, device_map="auto") model = T5ForConditionalGeneration.from_pretrained(self.model_name, load_in_4bit=True, device_map="auto")
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
# test with `flan-t5-small` # test with `flan-t5-small`
model = T5ForConditionalGeneration.from_pretrained( model = T5ForConditionalGeneration.from_pretrained(
self.dense_act_model_name, load_in_4bit=True, device_map="auto" self.dense_act_model_name, load_in_4bit=True, device_map="auto"
) )
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
T5ForConditionalGeneration._keep_in_fp32_modules = modules T5ForConditionalGeneration._keep_in_fp32_modules = modules
@@ -400,17 +409,18 @@ class Bnb4BitT5Test(unittest.TestCase):
# there was a bug with decoders - this test checks that it is fixed # there was a bug with decoders - this test checks that it is fixed
self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear4bit)) self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear4bit))
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
# test with `flan-t5-small` # test with `flan-t5-small`
model = T5ForConditionalGeneration.from_pretrained( model = T5ForConditionalGeneration.from_pretrained(
self.dense_act_model_name, load_in_4bit=True, device_map="auto" self.dense_act_model_name, load_in_4bit=True, device_map="auto"
) )
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
@apply_skip_if_not_implemented
class Classes4BitModelTest(Base4bitTest): class Classes4BitModelTest(Base4bitTest):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -460,6 +470,7 @@ class Classes4BitModelTest(Base4bitTest):
self.assertTrue(self.seq_to_seq_model.lm_head.weight.__class__ == torch.nn.Parameter) self.assertTrue(self.seq_to_seq_model.lm_head.weight.__class__ == torch.nn.Parameter)
@apply_skip_if_not_implemented
class Pipeline4BitTest(Base4bitTest): class Pipeline4BitTest(Base4bitTest):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -469,7 +480,8 @@ class Pipeline4BitTest(Base4bitTest):
TearDown function needs to be called at the end of each test to free the GPU memory and cache, also to TearDown function needs to be called at the end of each test to free the GPU memory and cache, also to
avoid unexpected behaviors. Please see: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/27 avoid unexpected behaviors. Please see: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/27
""" """
del self.pipe if hasattr(self, "pipe"):
del self.pipe
gc.collect() gc.collect()
torch.cuda.empty_cache() torch.cuda.empty_cache()
@@ -484,7 +496,12 @@ class Pipeline4BitTest(Base4bitTest):
self.pipe = pipeline( self.pipe = pipeline(
"text-generation", "text-generation",
model=self.model_name, model=self.model_name,
model_kwargs={"device_map": "auto", "load_in_4bit": True, "torch_dtype": torch.float16}, model_kwargs={
"device_map": "auto",
"load_in_4bit": True,
# float16 isn't supported on CPU, use bfloat16 instead
"torch_dtype": torch.bfloat16 if torch_device == "cpu" else torch.float16,
},
max_new_tokens=self.MAX_NEW_TOKENS, max_new_tokens=self.MAX_NEW_TOKENS,
) )
@@ -494,6 +511,7 @@ class Pipeline4BitTest(Base4bitTest):
@require_torch_multi_gpu @require_torch_multi_gpu
@apply_skip_if_not_implemented
class Bnb4bitTestMultiGpu(Base4bitTest): class Bnb4bitTestMultiGpu(Base4bitTest):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -515,10 +533,13 @@ class Bnb4bitTestMultiGpu(Base4bitTest):
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
# Second real batch # Second real batch
output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_parallel = model_parallel.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@apply_skip_if_not_implemented
class Bnb4BitTestTraining(Base4bitTest): class Bnb4BitTestTraining(Base4bitTest):
def setUp(self): def setUp(self):
self.model_name = "facebook/opt-350m" self.model_name = "facebook/opt-350m"
@@ -531,7 +552,10 @@ class Bnb4BitTestTraining(Base4bitTest):
# Step 1: freeze all parameters # Step 1: freeze all parameters
model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_4bit=True) model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_4bit=True)
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()}) if torch.cuda.is_available():
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
else:
self.assertTrue(all(param.device.type == "cpu" for param in model.parameters()))
for param in model.parameters(): for param in model.parameters():
param.requires_grad = False # freeze the model - train adapters later param.requires_grad = False # freeze the model - train adapters later
@@ -547,10 +571,10 @@ class Bnb4BitTestTraining(Base4bitTest):
module.v_proj = LoRALayer(module.v_proj, rank=16) module.v_proj = LoRALayer(module.v_proj, rank=16)
# Step 3: dummy batch # Step 3: dummy batch
batch = self.tokenizer("Test batch ", return_tensors="pt").to(0) batch = self.tokenizer("Test batch ", return_tensors="pt").to(torch_device)
# Step 4: Check if the gradient is not None # Step 4: Check if the gradient is not None
with torch.cuda.amp.autocast(): with torch.autocast(torch_device):
out = model.forward(**batch) out = model.forward(**batch)
out.logits.norm().backward() out.logits.norm().backward()
@@ -562,6 +586,7 @@ class Bnb4BitTestTraining(Base4bitTest):
self.assertTrue(module.weight.grad is None) self.assertTrue(module.weight.grad is None)
@apply_skip_if_not_implemented
class Bnb4BitGPT2Test(Bnb4BitTest): class Bnb4BitGPT2Test(Bnb4BitTest):
model_name = "openai-community/gpt2-xl" model_name = "openai-community/gpt2-xl"
EXPECTED_RELATIVE_DIFFERENCE = 3.3191854854152187 EXPECTED_RELATIVE_DIFFERENCE = 3.3191854854152187
@@ -570,8 +595,9 @@ class Bnb4BitGPT2Test(Bnb4BitTest):
@require_bitsandbytes @require_bitsandbytes
@require_accelerate @require_accelerate
@require_torch @require_torch
@require_torch_gpu @require_torch_gpu_if_bnb_not_multi_backend_enabled
@slow @slow
@apply_skip_if_not_implemented
class BaseSerializationTest(unittest.TestCase): class BaseSerializationTest(unittest.TestCase):
model_name = "facebook/opt-125m" model_name = "facebook/opt-125m"
input_text = "Mars colonists' favorite meals are" input_text = "Mars colonists' favorite meals are"
@@ -635,7 +661,9 @@ class BaseSerializationTest(unittest.TestCase):
d1[k].quant_state.as_dict().values(), d1[k].quant_state.as_dict().values(),
): ):
if isinstance(v0, torch.Tensor): if isinstance(v0, torch.Tensor):
self.assertTrue(torch.equal(v0, v1.to(v0.device))) # The absmax will not be saved in the quant_state when using NF4 in CPU
if v0.numel() != 0:
self.assertTrue(torch.equal(v0, v1.to(v0.device)))
else: else:
self.assertTrue(v0 == v1) self.assertTrue(v0 == v1)
@@ -659,6 +687,7 @@ class BaseSerializationTest(unittest.TestCase):
) )
@apply_skip_if_not_implemented
class ExtendedSerializationTest(BaseSerializationTest): class ExtendedSerializationTest(BaseSerializationTest):
""" """
tests more combinations of parameters tests more combinations of parameters
@@ -706,8 +735,9 @@ class GPTSerializationTest(BaseSerializationTest):
@require_bitsandbytes @require_bitsandbytes
@require_accelerate @require_accelerate
@require_torch_gpu @require_torch_gpu_if_bnb_not_multi_backend_enabled
@slow @slow
@apply_skip_if_not_implemented
class Bnb4BitTestBasicConfigTest(unittest.TestCase): class Bnb4BitTestBasicConfigTest(unittest.TestCase):
def test_load_in_4_and_8_bit_fails(self): def test_load_in_4_and_8_bit_fails(self):
with self.assertRaisesRegex(ValueError, "load_in_4bit and load_in_8bit are both True"): with self.assertRaisesRegex(ValueError, "load_in_4bit and load_in_8bit are both True"):

View File

@@ -30,14 +30,17 @@ from transformers import (
pipeline, pipeline,
) )
from transformers.testing_utils import ( from transformers.testing_utils import (
apply_skip_if_not_implemented,
is_accelerate_available, is_accelerate_available,
is_bitsandbytes_available,
is_torch_available, is_torch_available,
require_accelerate, require_accelerate,
require_bitsandbytes, require_bitsandbytes,
require_torch, require_torch,
require_torch_gpu, require_torch_gpu_if_bnb_not_multi_backend_enabled,
require_torch_multi_gpu, require_torch_multi_gpu,
slow, slow,
torch_device,
) )
@@ -77,10 +80,14 @@ if is_torch_available():
return self.module(input, *args, **kwargs) + self.adapter(input) return self.module(input, *args, **kwargs) + self.adapter(input)
if is_bitsandbytes_available():
import bitsandbytes as bnb
@require_bitsandbytes @require_bitsandbytes
@require_accelerate @require_accelerate
@require_torch @require_torch
@require_torch_gpu @require_torch_gpu_if_bnb_not_multi_backend_enabled
@slow @slow
class BaseMixedInt8Test(unittest.TestCase): class BaseMixedInt8Test(unittest.TestCase):
# We keep the constants inside the init function and model loading inside setUp function # We keep the constants inside the init function and model loading inside setUp function
@@ -108,6 +115,7 @@ class BaseMixedInt8Test(unittest.TestCase):
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name) self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
@apply_skip_if_not_implemented
class MixedInt8Test(BaseMixedInt8Test): class MixedInt8Test(BaseMixedInt8Test):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -240,7 +248,6 @@ class MixedInt8Test(BaseMixedInt8Test):
r""" r"""
A simple test to check if `llm_int8_skip_modules` works as expected A simple test to check if `llm_int8_skip_modules` works as expected
""" """
import bitsandbytes as bnb
quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=["classifier"]) quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=["classifier"])
seq_classification_model = AutoModelForSequenceClassification.from_pretrained( seq_classification_model = AutoModelForSequenceClassification.from_pretrained(
@@ -263,7 +270,9 @@ class MixedInt8Test(BaseMixedInt8Test):
the same output across GPUs. So we'll generate few tokens (5-10) and check their output. the same output across GPUs. So we'll generate few tokens (5-10) and check their output.
""" """
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = self.model_8bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = self.model_8bit.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -280,7 +289,7 @@ class MixedInt8Test(BaseMixedInt8Test):
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_8bit_from_config.generate( output_sequences = model_8bit_from_config.generate(
input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10 input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
) )
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -298,7 +307,9 @@ class MixedInt8Test(BaseMixedInt8Test):
model_8bit.dequantize() model_8bit.dequantize()
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_8bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = model_8bit.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -319,8 +330,10 @@ class MixedInt8Test(BaseMixedInt8Test):
def test_device_and_dtype_assignment(self): def test_device_and_dtype_assignment(self):
r""" r"""
Test whether trying to cast (or assigning a device to) a model after converting it in 8-bit will throw an error. Test whether attempting to change the device or cast the dtype of a model
Checks also if other models are casted correctly. after converting it to 8-bit precision will raise an appropriate error.
The test ensures that such operations are prohibited on 8-bit models
to prevent invalid conversions.
""" """
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
# Tries with `str` # Tries with `str`
@@ -332,21 +345,21 @@ class MixedInt8Test(BaseMixedInt8Test):
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
# Tries with a `device` # Tries with a `device`
self.model_8bit.to(torch.device("cuda:0")) self.model_8bit.to(torch.device(torch_device))
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
# Tries with a `device` # Tries to cast the 8-bit model to float32 using `float()`
self.model_8bit.float() self.model_8bit.float()
with self.assertRaises(ValueError): with self.assertRaises(ValueError):
# Tries with a `device` # Tries to cast the 4-bit model to float16 using `half()`
self.model_8bit.half() self.model_8bit.half()
# Test if we did not break anything # Test if we did not break anything
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
self.model_fp16 = self.model_fp16.to(torch.float32) self.model_fp16 = self.model_fp16.to(torch.float32)
_ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) _ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
# Check this does not throw an error # Check this does not throw an error
_ = self.model_fp16.to("cpu") _ = self.model_fp16.to("cpu")
@@ -385,7 +398,9 @@ class MixedInt8Test(BaseMixedInt8Test):
# generate # generate
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_from_saved.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = model_from_saved.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -410,7 +425,9 @@ class MixedInt8Test(BaseMixedInt8Test):
# generate # generate
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_from_saved.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = model_from_saved.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -435,7 +452,9 @@ class MixedInt8Test(BaseMixedInt8Test):
# generate # generate
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model_from_saved.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = model_from_saved.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -455,7 +474,7 @@ class MixedInt8Test(BaseMixedInt8Test):
# generate # generate
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@@ -463,7 +482,7 @@ class MixedInt8Test(BaseMixedInt8Test):
@require_bitsandbytes @require_bitsandbytes
@require_accelerate @require_accelerate
@require_torch @require_torch
@require_torch_gpu @require_torch_gpu_if_bnb_not_multi_backend_enabled
@slow @slow
class MixedInt8T5Test(unittest.TestCase): class MixedInt8T5Test(unittest.TestCase):
@classmethod @classmethod
@@ -494,14 +513,14 @@ class MixedInt8T5Test(unittest.TestCase):
# test with `google-t5/t5-small` # test with `google-t5/t5-small`
model = T5ForConditionalGeneration.from_pretrained(self.model_name, load_in_8bit=True, device_map="auto") model = T5ForConditionalGeneration.from_pretrained(self.model_name, load_in_8bit=True, device_map="auto")
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
# test with `flan-t5-small` # test with `flan-t5-small`
model = T5ForConditionalGeneration.from_pretrained( model = T5ForConditionalGeneration.from_pretrained(
self.dense_act_model_name, load_in_8bit=True, device_map="auto" self.dense_act_model_name, load_in_8bit=True, device_map="auto"
) )
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
T5ForConditionalGeneration._keep_in_fp32_modules = modules T5ForConditionalGeneration._keep_in_fp32_modules = modules
@@ -511,7 +530,6 @@ class MixedInt8T5Test(unittest.TestCase):
`flan-t5-small` uses `T5DenseGatedActDense` whereas `google-t5/t5-small` uses `T5DenseReluDense`. We need to test `flan-t5-small` uses `T5DenseGatedActDense` whereas `google-t5/t5-small` uses `T5DenseReluDense`. We need to test
both cases. both cases.
""" """
import bitsandbytes as bnb
from transformers import T5ForConditionalGeneration from transformers import T5ForConditionalGeneration
@@ -521,14 +539,14 @@ class MixedInt8T5Test(unittest.TestCase):
# there was a bug with decoders - this test checks that it is fixed # there was a bug with decoders - this test checks that it is fixed
self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear8bitLt)) self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear8bitLt))
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
# test with `flan-t5-small` # test with `flan-t5-small`
model = T5ForConditionalGeneration.from_pretrained( model = T5ForConditionalGeneration.from_pretrained(
self.dense_act_model_name, load_in_8bit=True, device_map="auto" self.dense_act_model_name, load_in_8bit=True, device_map="auto"
) )
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
def test_inference_with_keep_in_fp32_serialized(self): def test_inference_with_keep_in_fp32_serialized(self):
@@ -538,7 +556,6 @@ class MixedInt8T5Test(unittest.TestCase):
`flan-t5-small` uses `T5DenseGatedActDense` whereas `google-t5/t5-small` uses `T5DenseReluDense`. We need to test `flan-t5-small` uses `T5DenseGatedActDense` whereas `google-t5/t5-small` uses `T5DenseReluDense`. We need to test
both cases. both cases.
""" """
import bitsandbytes as bnb
from transformers import T5ForConditionalGeneration from transformers import T5ForConditionalGeneration
@@ -553,14 +570,14 @@ class MixedInt8T5Test(unittest.TestCase):
# there was a bug with decoders - this test checks that it is fixed # there was a bug with decoders - this test checks that it is fixed
self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear8bitLt)) self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear8bitLt))
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
# test with `flan-t5-small` # test with `flan-t5-small`
model = T5ForConditionalGeneration.from_pretrained( model = T5ForConditionalGeneration.from_pretrained(
self.dense_act_model_name, load_in_8bit=True, device_map="auto" self.dense_act_model_name, load_in_8bit=True, device_map="auto"
) )
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0) encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
_ = model.generate(**encoded_input) _ = model.generate(**encoded_input)
@@ -614,6 +631,7 @@ class MixedInt8ModelClassesTest(BaseMixedInt8Test):
self.assertTrue(self.seq_to_seq_model.lm_head.weight.__class__ == torch.nn.Parameter) self.assertTrue(self.seq_to_seq_model.lm_head.weight.__class__ == torch.nn.Parameter)
@apply_skip_if_not_implemented
class MixedInt8TestPipeline(BaseMixedInt8Test): class MixedInt8TestPipeline(BaseMixedInt8Test):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -623,7 +641,8 @@ class MixedInt8TestPipeline(BaseMixedInt8Test):
TearDown function needs to be called at the end of each test to free the GPU memory and cache, also to TearDown function needs to be called at the end of each test to free the GPU memory and cache, also to
avoid unexpected behaviors. Please see: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/27 avoid unexpected behaviors. Please see: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/27
""" """
del self.pipe if hasattr(self, "pipe"):
del self.pipe
gc.collect() gc.collect()
torch.cuda.empty_cache() torch.cuda.empty_cache()
@@ -648,6 +667,7 @@ class MixedInt8TestPipeline(BaseMixedInt8Test):
@require_torch_multi_gpu @require_torch_multi_gpu
@apply_skip_if_not_implemented
class MixedInt8TestMultiGpu(BaseMixedInt8Test): class MixedInt8TestMultiGpu(BaseMixedInt8Test):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -669,11 +689,14 @@ class MixedInt8TestMultiGpu(BaseMixedInt8Test):
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
# Second real batch # Second real batch
output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_parallel = model_parallel.generate(
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
)
self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
@require_torch_multi_gpu @require_torch_multi_gpu
@apply_skip_if_not_implemented
class MixedInt8TestCpuGpu(BaseMixedInt8Test): class MixedInt8TestCpuGpu(BaseMixedInt8Test):
def setUp(self): def setUp(self):
super().setUp() super().setUp()
@@ -683,7 +706,7 @@ class MixedInt8TestCpuGpu(BaseMixedInt8Test):
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
# Check the exactness of the results # Check the exactness of the results
output_parallel = model.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_parallel = model.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
# Get the generation # Get the generation
output_text = self.tokenizer.decode(output_parallel[0], skip_special_tokens=True) output_text = self.tokenizer.decode(output_parallel[0], skip_special_tokens=True)
@@ -819,6 +842,7 @@ class MixedInt8TestCpuGpu(BaseMixedInt8Test):
self.check_inference_correctness(model_8bit) self.check_inference_correctness(model_8bit)
@apply_skip_if_not_implemented
class MixedInt8TestTraining(BaseMixedInt8Test): class MixedInt8TestTraining(BaseMixedInt8Test):
def setUp(self): def setUp(self):
self.model_name = "facebook/opt-350m" self.model_name = "facebook/opt-350m"
@@ -831,7 +855,10 @@ class MixedInt8TestTraining(BaseMixedInt8Test):
# Step 1: freeze all parameters # Step 1: freeze all parameters
model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True) model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True)
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()}) if torch.cuda.is_available():
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
else:
self.assertTrue(all(param.device.type == "cpu" for param in model.parameters()))
for param in model.parameters(): for param in model.parameters():
param.requires_grad = False # freeze the model - train adapters later param.requires_grad = False # freeze the model - train adapters later
@@ -847,10 +874,10 @@ class MixedInt8TestTraining(BaseMixedInt8Test):
module.v_proj = LoRALayer(module.v_proj, rank=16) module.v_proj = LoRALayer(module.v_proj, rank=16)
# Step 3: dummy batch # Step 3: dummy batch
batch = self.tokenizer("Test batch ", return_tensors="pt").to(0) batch = self.tokenizer("Test batch ", return_tensors="pt").to(torch_device)
# Step 4: Check if the gradient is not None # Step 4: Check if the gradient is not None
with torch.cuda.amp.autocast(): with torch.autocast(torch_device):
out = model.forward(**batch) out = model.forward(**batch)
out.logits.norm().backward() out.logits.norm().backward()
@@ -862,6 +889,7 @@ class MixedInt8TestTraining(BaseMixedInt8Test):
self.assertTrue(module.weight.grad is None) self.assertTrue(module.weight.grad is None)
@apply_skip_if_not_implemented
class MixedInt8GPT2Test(MixedInt8Test): class MixedInt8GPT2Test(MixedInt8Test):
model_name = "openai-community/gpt2-xl" model_name = "openai-community/gpt2-xl"
EXPECTED_RELATIVE_DIFFERENCE = 1.8720077507258357 EXPECTED_RELATIVE_DIFFERENCE = 1.8720077507258357
@@ -870,6 +898,9 @@ class MixedInt8GPT2Test(MixedInt8Test):
EXPECTED_OUTPUTS.add("Hello my name is John Doe, and I'm a fan of the") EXPECTED_OUTPUTS.add("Hello my name is John Doe, and I'm a fan of the")
# Expected values on a A10 # Expected values on a A10
EXPECTED_OUTPUTS.add("Hello my name is John Doe, and I am a member of the") EXPECTED_OUTPUTS.add("Hello my name is John Doe, and I am a member of the")
# Expected values on Intel CPU
EXPECTED_OUTPUTS.add("Hello my name is John Doe. I am a man. I am")
EXPECTED_OUTPUTS.add("Hello my name is John, and I'm a writer. I'm")
def test_int8_from_pretrained(self): def test_int8_from_pretrained(self):
r""" r"""
@@ -887,6 +918,6 @@ class MixedInt8GPT2Test(MixedInt8Test):
# generate # generate
encoded_input = self.tokenizer(self.input_text, return_tensors="pt") encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10) output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS) self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)