Enable BNB multi-backend support (#31098)
* enable cpu bnb path * fix style * fix code style * fix 4 bit path * Update src/transformers/utils/import_utils.py Co-authored-by: Aarni Koskela <akx@iki.fi> * add multi backend refactor tests * fix style * tweak 4bit quantizer + fix corresponding tests * tweak 8bit quantizer + *try* fixing corresponding tests * fix dequant bnb 8bit * account for Intel CPU in variability of expected outputs * enable cpu and xpu device map * further tweaks to account for Intel CPU * fix autocast to work with both cpu + cuda * fix comments * fix comments * switch to testing_utils.torch_device * allow for xpu in multi-gpu tests * fix tests 4bit for CPU NF4 * fix bug with is_torch_xpu_available needing to be called as func * avoid issue where test reports attr err due to other failure * fix formatting * fix typo from resolving of merge conflict * polish based on last PR review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix CI * Update src/transformers/integrations/integration_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/integrations/integration_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix error log * fix error msg * add \n in error log * make quality * rm bnb cuda restriction in doc * cpu model don't need dispatch * fix doc * fix style * check cuda avaliable in testing * fix tests * Update docs/source/en/model_doc/chameleon.md Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: Aarni Koskela <akx@iki.fi> * Update tests/quantization/bnb/test_4bit.py Co-authored-by: Aarni Koskela <akx@iki.fi> * Update tests/quantization/bnb/test_4bit.py Co-authored-by: Aarni Koskela <akx@iki.fi> * fix doc * fix check multibackends * fix import sort * remove check torch in bnb * docs: update bitsandbytes references with multi-backend info * docs: fix small mistakes in bnb paragraph * run formatting * reveret bnb check * move bnb multi-backend check to import_utils * Update src/transformers/utils/import_utils.py Co-authored-by: Aarni Koskela <akx@iki.fi> * fix bnb check * minor fix for bnb * check lib first * fix code style * Revert "run formatting" This reverts commit ac108c6d6b34f45a5745a736ba57282405cfaa61. * fix format * give warning when bnb version is low and no cuda found] * fix device assignment check to be multi-device capable * address akx feedback on get_avlbl_dev fn * revert partially, as we don't want the function that public, as docs would be too much (enforced) --------- Co-authored-by: Aarni Koskela <akx@iki.fi> Co-authored-by: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
@@ -181,7 +181,7 @@ for every matrix multiplication. Dequantization and re-quantization is performed
|
||||
|
||||
Therefore, inference time is often **not** reduced when using quantized weights, but rather increases.
|
||||
Enough theory, let's give it a try! To quantize the weights with Transformers, you need to make sure that
|
||||
the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) library is installed.
|
||||
the [`bitsandbytes`](https://github.com/bitsandbytes-foundation/bitsandbytes) library is installed.
|
||||
|
||||
```bash
|
||||
!pip install bitsandbytes
|
||||
|
||||
@@ -128,7 +128,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
|
||||
|
||||
### Quantization using Bitsandbytes
|
||||
|
||||
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
|
||||
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
|
||||
|
||||
<Tip>
|
||||
|
||||
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
Simply change the snippet above with:
|
||||
|
||||
```python
|
||||
from transformers import ChameleonForConditionalGeneration, BitsAndBytesConfig
|
||||
|
||||
@@ -233,7 +233,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
|
||||
|
||||
### Quantization using Bitsandbytes
|
||||
|
||||
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
|
||||
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes`, and to have access to a GPU/accelerator that is supported by the library.
|
||||
|
||||
<Tip>
|
||||
|
||||
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
Simply change the snippet above with:
|
||||
|
||||
```python
|
||||
from transformers import LlavaNextForConditionalGeneration, BitsAndBytesConfig
|
||||
|
||||
@@ -205,7 +205,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
|
||||
|
||||
The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. This allows for efficient deployment on resource-constrained cases.
|
||||
|
||||
First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a CUDA compatible GPU device. Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
|
||||
First, make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
|
||||
|
||||
<Tip>
|
||||
|
||||
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
Then simply load the quantized model by adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
|
||||
|
||||
|
||||
```python
|
||||
|
||||
@@ -264,9 +264,19 @@ processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spac
|
||||
|
||||
## Model optimization
|
||||
|
||||
### Quantization using Bitsandbytes
|
||||
### Quantization using bitsandbytes
|
||||
|
||||
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
|
||||
The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a GPU/accelerator that is supported by the library.
|
||||
|
||||
<Tip>
|
||||
|
||||
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
Simply change the snippet above with:
|
||||
|
||||
```python
|
||||
from transformers import LlavaOnevisionForConditionalGeneration, BitsAndBytesConfig
|
||||
|
||||
@@ -141,7 +141,7 @@ The Flash Attention-2 model uses also a more memory efficient cache slicing mech
|
||||
|
||||
As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required.
|
||||
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization.md) for other quantization methods):
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization.md) for alternative quantization methods):
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
|
||||
@@ -139,7 +139,17 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
|
||||
|
||||
The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases.
|
||||
|
||||
First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a CUDA compatible GPU device. Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
|
||||
First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
|
||||
|
||||
<Tip>
|
||||
|
||||
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
|
||||
|
||||
|
||||
```python
|
||||
|
||||
@@ -233,7 +233,7 @@ Let's look at the details.
|
||||
**Optimizer States:**
|
||||
|
||||
- 8 bytes * number of parameters for normal AdamW (maintains 2 states)
|
||||
- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
|
||||
- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes)
|
||||
- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)
|
||||
|
||||
**Gradients**
|
||||
|
||||
@@ -284,7 +284,7 @@ training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bn
|
||||
|
||||
However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated.
|
||||
|
||||
First, follow the installation guide in the GitHub [repo](https://github.com/TimDettmers/bitsandbytes) to install the `bitsandbytes` library
|
||||
First, follow the installation guide in the GitHub [repo](https://github.com/bitsandbytes-foundation/bitsandbytes) to install the `bitsandbytes` library
|
||||
that implements the 8-bit Adam optimizer.
|
||||
|
||||
Next you need to initialize the optimizer. This involves two steps:
|
||||
|
||||
@@ -38,6 +38,14 @@ pip install --upgrade accelerate transformers
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
<Tip>
|
||||
|
||||
bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
Now you can quantize a model by passing a `BitsAndBytesConfig` to [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers.
|
||||
|
||||
<hfoptions id="bnb">
|
||||
|
||||
@@ -49,7 +49,7 @@ Use the table below to help you decide which quantization method to use.
|
||||
|-------------------------------------|-------------------------|-----|----------|----------------|-----------------------|-------------------------|----------------|-------------------------------------|--------------|------------------------|---------------------------------------------|
|
||||
| [AQLM](./aqlm) | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 1 / 2 | 🟢 | 🟢 | 🟢 | https://github.com/Vahe1994/AQLM |
|
||||
| [AWQ](./awq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | ? | 4 | 🟢 | 🟢 | 🟢 | https://github.com/casper-hansen/AutoAWQ |
|
||||
| [bitsandbytes](./bitsandbytes) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 4 / 8 | 🟢 | 🟢 | 🟢 | https://github.com/TimDettmers/bitsandbytes |
|
||||
| [bitsandbytes](./bitsandbytes) | 🟢 | 🟡 * | 🟢 | 🟡 * | 🔴 ** | 🔴 (soon!) | 4 / 8 | 🟢 | 🟢 | 🟢 | https://github.com/bitsandbytes-foundation/bitsandbytes |
|
||||
| [EETQ](./eetq) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | ? | 8 | 🟢 | 🟢 | 🟢 | https://github.com/NetEase-FuXi/EETQ |
|
||||
| GGUF / GGML (llama.cpp) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 1 - 8 | 🔴 | [See GGUF section](../gguf) | [See GGUF section](../gguf) | https://github.com/ggerganov/llama.cpp |
|
||||
| [GPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 2 - 3 - 4 - 8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ |
|
||||
@@ -57,3 +57,17 @@ Use the table below to help you decide which quantization method to use.
|
||||
| [Quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 2 / 4 / 8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/quanto |
|
||||
| [FBGEMM_FP8](./fbgemm_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
|
||||
| [torchao](./torchao.md) | 🟢 | | 🟢 | 🔴 | partial support (int4 weight only) | | 4 / 8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
|
||||
|
||||
<Tip>
|
||||
|
||||
\* bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
|
||||
|
||||
We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
\** bitsandbytes is seeking contributors to help develop and lead the Apple Silicon backend. Interested? Contact them directly via their repo. Stipends may be available through sponsorships.
|
||||
|
||||
</Tip>
|
||||
|
||||
@@ -31,6 +31,7 @@ _import_structure = {
|
||||
"replace_with_bnb_linear",
|
||||
"set_module_8bit_tensor_to_device",
|
||||
"set_module_quantized_tensor_to_device",
|
||||
"validate_bnb_backend_availability",
|
||||
],
|
||||
"deepspeed": [
|
||||
"HfDeepSpeedConfig",
|
||||
@@ -124,6 +125,7 @@ if TYPE_CHECKING:
|
||||
replace_with_bnb_linear,
|
||||
set_module_8bit_tensor_to_device,
|
||||
set_module_quantized_tensor_to_device,
|
||||
validate_bnb_backend_availability,
|
||||
)
|
||||
from .deepspeed import (
|
||||
HfDeepSpeedConfig,
|
||||
|
||||
@@ -6,7 +6,15 @@ from inspect import signature
|
||||
|
||||
from packaging import version
|
||||
|
||||
from ..utils import is_accelerate_available, is_bitsandbytes_available, logging
|
||||
from ..utils import (
|
||||
get_available_devices,
|
||||
is_accelerate_available,
|
||||
is_bitsandbytes_available,
|
||||
is_bitsandbytes_multi_backend_available,
|
||||
is_ipex_available,
|
||||
is_torch_available,
|
||||
logging,
|
||||
)
|
||||
|
||||
|
||||
if is_bitsandbytes_available():
|
||||
@@ -332,7 +340,7 @@ def get_keys_to_not_convert(model):
|
||||
|
||||
|
||||
# Copied from PEFT: https://github.com/huggingface/peft/blob/47b3712898539569c02ec5b3ed4a6c36811331a1/src/peft/utils/integrations.py#L41
|
||||
def dequantize_bnb_weight(weight: "torch.nn.Parameter", state=None):
|
||||
def dequantize_bnb_weight(weight: "torch.nn.Parameter", dtype: "torch.dtype", state=None):
|
||||
"""
|
||||
Helper function to dequantize 4bit or 8bit bnb weights.
|
||||
|
||||
@@ -350,7 +358,7 @@ def dequantize_bnb_weight(weight: "torch.nn.Parameter", state=None):
|
||||
logger.warning_once(
|
||||
f"The model is going to be dequantized in {output_tensor.dtype} - if you want to upcast it to another dtype, make sure to pass the desired dtype when quantizing the model through `bnb_4bit_quant_type` argument of `BitsAndBytesConfig`"
|
||||
)
|
||||
return output_tensor
|
||||
return output_tensor.to(dtype)
|
||||
|
||||
if state.SCB is None:
|
||||
state.SCB = weight.SCB
|
||||
@@ -361,7 +369,7 @@ def dequantize_bnb_weight(weight: "torch.nn.Parameter", state=None):
|
||||
if state.CxB is None:
|
||||
state.CxB, state.SB = bnb.functional.transform(weight.data, to_order=state.formatB)
|
||||
out32, Sout32 = bnb.functional.igemmlt(im, state.CxB, Sim, state.SB)
|
||||
return bnb.functional.mm_dequant(out32, Sout32, SCim, state.SCB, bias=None).t()
|
||||
return bnb.functional.mm_dequant(out32, Sout32, SCim, state.SCB, bias=None).t().to(dtype)
|
||||
|
||||
|
||||
def _create_accelerate_new_hook(old_hook):
|
||||
@@ -383,6 +391,7 @@ def _create_accelerate_new_hook(old_hook):
|
||||
|
||||
def _dequantize_and_replace(
|
||||
model,
|
||||
dtype,
|
||||
modules_to_not_convert=None,
|
||||
current_key_name=None,
|
||||
quantization_config=None,
|
||||
@@ -422,7 +431,7 @@ def _dequantize_and_replace(
|
||||
else:
|
||||
state = None
|
||||
|
||||
new_module.weight = torch.nn.Parameter(dequantize_bnb_weight(module.weight, state))
|
||||
new_module.weight = torch.nn.Parameter(dequantize_bnb_weight(module.weight, dtype, state))
|
||||
|
||||
if bias is not None:
|
||||
new_module.bias = bias
|
||||
@@ -441,6 +450,7 @@ def _dequantize_and_replace(
|
||||
if len(list(module.children())) > 0:
|
||||
_, has_been_replaced = _dequantize_and_replace(
|
||||
module,
|
||||
dtype,
|
||||
modules_to_not_convert,
|
||||
current_key_name,
|
||||
quantization_config,
|
||||
@@ -458,6 +468,7 @@ def dequantize_and_replace(
|
||||
):
|
||||
model, has_been_replaced = _dequantize_and_replace(
|
||||
model,
|
||||
model.dtype,
|
||||
modules_to_not_convert=modules_to_not_convert,
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
@@ -468,3 +479,80 @@ def dequantize_and_replace(
|
||||
)
|
||||
|
||||
return model
|
||||
|
||||
|
||||
def _validate_bnb_multi_backend_availability(raise_exception):
|
||||
import bitsandbytes as bnb
|
||||
|
||||
bnb_supported_devices = getattr(bnb, "supported_torch_devices", set())
|
||||
available_devices = get_available_devices()
|
||||
|
||||
if available_devices == {"cpu"} and not is_ipex_available():
|
||||
from importlib.util import find_spec
|
||||
|
||||
if find_spec("intel_extension_for_pytorch"):
|
||||
logger.warning(
|
||||
"You have Intel IPEX installed but if you're intending to use it for CPU, it might not have the right version. Be sure to double check that your PyTorch and IPEX installs are compatible."
|
||||
)
|
||||
|
||||
available_devices.discard("cpu") # Only Intel CPU is supported by BNB at the moment
|
||||
|
||||
if not available_devices.intersection(bnb_supported_devices):
|
||||
if raise_exception:
|
||||
bnb_supported_devices_with_info = set( # noqa: C401
|
||||
'"cpu" (needs an Intel CPU and intel_extension_for_pytorch installed and compatible with the PyTorch version)'
|
||||
if device == "cpu"
|
||||
else device
|
||||
for device in bnb_supported_devices
|
||||
)
|
||||
err_msg = (
|
||||
f"None of the available devices `available_devices = {available_devices or None}` are supported by the bitsandbytes version you have installed: `bnb_supported_devices = {bnb_supported_devices_with_info}`. "
|
||||
"Please check the docs to see if the backend you intend to use is available and how to install it: https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend"
|
||||
)
|
||||
|
||||
logger.error(err_msg)
|
||||
raise RuntimeError(err_msg)
|
||||
|
||||
logger.warning("No supported devices found for bitsandbytes multi-backend.")
|
||||
return False
|
||||
|
||||
logger.debug("Multi-backend validation successful.")
|
||||
return True
|
||||
|
||||
|
||||
def _validate_bnb_cuda_backend_availability(raise_exception):
|
||||
if not is_torch_available():
|
||||
return False
|
||||
|
||||
import torch
|
||||
|
||||
if not torch.cuda.is_available():
|
||||
log_msg = (
|
||||
"CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. "
|
||||
"Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend"
|
||||
)
|
||||
if raise_exception:
|
||||
logger.error(log_msg)
|
||||
raise RuntimeError(log_msg)
|
||||
|
||||
logger.warning(log_msg)
|
||||
return False
|
||||
|
||||
logger.debug("CUDA backend validation successful.")
|
||||
return True
|
||||
|
||||
|
||||
def validate_bnb_backend_availability(raise_exception=False):
|
||||
"""
|
||||
Validates if the available devices are supported by bitsandbytes, optionally raising an exception if not.
|
||||
"""
|
||||
if not is_bitsandbytes_available():
|
||||
if importlib.util.find_spec("bitsandbytes") and version.parse(
|
||||
importlib.metadata.version("bitsandbytes")
|
||||
) < version.parse("0.43.1"):
|
||||
return _validate_bnb_cuda_backend_availability(raise_exception)
|
||||
return False
|
||||
|
||||
if is_bitsandbytes_multi_backend_available():
|
||||
return _validate_bnb_multi_backend_availability(raise_exception)
|
||||
return _validate_bnb_cuda_backend_availability(raise_exception)
|
||||
|
||||
@@ -29,6 +29,7 @@ from ..utils import (
|
||||
is_accelerate_available,
|
||||
is_bitsandbytes_available,
|
||||
is_torch_available,
|
||||
is_torch_xpu_available,
|
||||
logging,
|
||||
)
|
||||
|
||||
@@ -65,8 +66,6 @@ class Bnb4BitHfQuantizer(HfQuantizer):
|
||||
self.modules_to_not_convert = self.quantization_config.llm_int8_skip_modules
|
||||
|
||||
def validate_environment(self, *args, **kwargs):
|
||||
if not torch.cuda.is_available():
|
||||
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
|
||||
if not is_accelerate_available():
|
||||
raise ImportError(
|
||||
f"Using `bitsandbytes` 4-bit quantization requires Accelerate: `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`"
|
||||
@@ -76,6 +75,12 @@ class Bnb4BitHfQuantizer(HfQuantizer):
|
||||
"Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`"
|
||||
)
|
||||
|
||||
from ..integrations import validate_bnb_backend_availability
|
||||
from ..utils import is_bitsandbytes_multi_backend_available
|
||||
|
||||
bnb_multibackend_is_enabled = is_bitsandbytes_multi_backend_available()
|
||||
validate_bnb_backend_availability(raise_exception=True)
|
||||
|
||||
if kwargs.get("from_tf", False) or kwargs.get("from_flax", False):
|
||||
raise ValueError(
|
||||
"Converting into 4-bit or 8-bit weights from tf/flax weights is currently not supported, please make"
|
||||
@@ -91,7 +96,9 @@ class Bnb4BitHfQuantizer(HfQuantizer):
|
||||
device_map_without_lm_head = {
|
||||
key: device_map[key] for key in device_map.keys() if key not in self.modules_to_not_convert
|
||||
}
|
||||
if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
|
||||
if set(device_map.values()) == {"cpu"} and bnb_multibackend_is_enabled:
|
||||
pass
|
||||
elif "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
|
||||
raise ValueError(
|
||||
"Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the "
|
||||
"quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules "
|
||||
@@ -255,10 +262,15 @@ class Bnb4BitHfQuantizer(HfQuantizer):
|
||||
# Copied from transformers.quantizers.quantizer_bnb_8bit.Bnb8BitHfQuantizer.update_device_map
|
||||
def update_device_map(self, device_map):
|
||||
if device_map is None:
|
||||
device_map = {"": torch.cuda.current_device()}
|
||||
if torch.cuda.is_available():
|
||||
device_map = {"": torch.cuda.current_device()}
|
||||
elif is_torch_xpu_available():
|
||||
device_map = {"": f"xpu:{torch.xpu.current_device()}"}
|
||||
else:
|
||||
device_map = {"": "cpu"}
|
||||
logger.info(
|
||||
"The device_map was not initialized. "
|
||||
"Setting device_map to {'':torch.cuda.current_device()}. "
|
||||
f"Setting device_map to {device_map}. "
|
||||
"If you want to use the model for inference, please set device_map ='auto' "
|
||||
)
|
||||
return device_map
|
||||
|
||||
@@ -27,6 +27,7 @@ from ..utils import (
|
||||
is_accelerate_available,
|
||||
is_bitsandbytes_available,
|
||||
is_torch_available,
|
||||
is_torch_xpu_available,
|
||||
logging,
|
||||
)
|
||||
from .quantizers_utils import get_module_from_name
|
||||
@@ -64,9 +65,6 @@ class Bnb8BitHfQuantizer(HfQuantizer):
|
||||
self.modules_to_not_convert = self.quantization_config.llm_int8_skip_modules
|
||||
|
||||
def validate_environment(self, *args, **kwargs):
|
||||
if not torch.cuda.is_available():
|
||||
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
|
||||
|
||||
if not is_accelerate_available():
|
||||
raise ImportError(
|
||||
f"Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`"
|
||||
@@ -76,6 +74,12 @@ class Bnb8BitHfQuantizer(HfQuantizer):
|
||||
"Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`"
|
||||
)
|
||||
|
||||
from ..integrations import validate_bnb_backend_availability
|
||||
from ..utils import is_bitsandbytes_multi_backend_available
|
||||
|
||||
bnb_multibackend_is_enabled = is_bitsandbytes_multi_backend_available()
|
||||
validate_bnb_backend_availability(raise_exception=True)
|
||||
|
||||
if kwargs.get("from_tf", False) or kwargs.get("from_flax", False):
|
||||
raise ValueError(
|
||||
"Converting into 4-bit or 8-bit weights from tf/flax weights is currently not supported, please make"
|
||||
@@ -91,7 +95,9 @@ class Bnb8BitHfQuantizer(HfQuantizer):
|
||||
device_map_without_lm_head = {
|
||||
key: device_map[key] for key in device_map.keys() if key not in self.modules_to_not_convert
|
||||
}
|
||||
if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
|
||||
if set(device_map.values()) == {"cpu"} and bnb_multibackend_is_enabled:
|
||||
pass
|
||||
elif "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values():
|
||||
raise ValueError(
|
||||
"Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the "
|
||||
"quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules "
|
||||
@@ -127,10 +133,15 @@ class Bnb8BitHfQuantizer(HfQuantizer):
|
||||
|
||||
def update_device_map(self, device_map):
|
||||
if device_map is None:
|
||||
device_map = {"": torch.cuda.current_device()}
|
||||
if torch.cuda.is_available():
|
||||
device_map = {"": torch.cuda.current_device()}
|
||||
elif is_torch_xpu_available():
|
||||
device_map = {"": f"xpu:{torch.xpu.current_device()}"}
|
||||
else:
|
||||
device_map = {"": "cpu"}
|
||||
logger.info(
|
||||
"The device_map was not initialized. "
|
||||
"Setting device_map to {'':torch.cuda.current_device()}. "
|
||||
f"Setting device_map to {device_map}. "
|
||||
"If you want to use the model for inference, please set device_map ='auto' "
|
||||
)
|
||||
return device_map
|
||||
|
||||
@@ -61,6 +61,7 @@ from .utils import (
|
||||
is_auto_gptq_available,
|
||||
is_av_available,
|
||||
is_bitsandbytes_available,
|
||||
is_bitsandbytes_multi_backend_available,
|
||||
is_bs4_available,
|
||||
is_cv2_available,
|
||||
is_cython_available,
|
||||
@@ -224,6 +225,17 @@ _run_agent_tests = parse_flag_from_env("RUN_AGENT_TESTS", default=False)
|
||||
_run_third_party_device_tests = parse_flag_from_env("RUN_THIRD_PARTY_DEVICE_TESTS", default=False)
|
||||
|
||||
|
||||
def get_device_count():
|
||||
import torch
|
||||
|
||||
if is_torch_xpu_available():
|
||||
num_devices = torch.xpu.device_count()
|
||||
else:
|
||||
num_devices = torch.cuda.device_count()
|
||||
|
||||
return num_devices
|
||||
|
||||
|
||||
def is_pt_tf_cross_test(test_case):
|
||||
"""
|
||||
Decorator marking a test as a test that control interactions between PyTorch and TensorFlow.
|
||||
@@ -331,6 +343,29 @@ def tooslow(test_case):
|
||||
return unittest.skip(reason="test is too slow")(test_case)
|
||||
|
||||
|
||||
def skip_if_not_implemented(test_func):
|
||||
@functools.wraps(test_func)
|
||||
def wrapper(*args, **kwargs):
|
||||
try:
|
||||
return test_func(*args, **kwargs)
|
||||
except NotImplementedError as e:
|
||||
raise unittest.SkipTest(f"Test skipped due to NotImplementedError: {e}")
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
def apply_skip_if_not_implemented(cls):
|
||||
"""
|
||||
Class decorator to apply @skip_if_not_implemented to all test methods.
|
||||
"""
|
||||
for attr_name in dir(cls):
|
||||
if attr_name.startswith("test_"):
|
||||
attr = getattr(cls, attr_name)
|
||||
if callable(attr):
|
||||
setattr(cls, attr_name, skip_if_not_implemented(attr))
|
||||
return cls
|
||||
|
||||
|
||||
def custom_tokenizers(test_case):
|
||||
"""
|
||||
Decorator marking a test for a custom tokenizer.
|
||||
@@ -738,9 +773,9 @@ def require_torch_multi_gpu(test_case):
|
||||
if not is_torch_available():
|
||||
return unittest.skip(reason="test requires PyTorch")(test_case)
|
||||
|
||||
import torch
|
||||
device_count = get_device_count()
|
||||
|
||||
return unittest.skipUnless(torch.cuda.device_count() > 1, "test requires multiple GPUs")(test_case)
|
||||
return unittest.skipUnless(device_count > 1, "test requires multiple GPUs")(test_case)
|
||||
|
||||
|
||||
def require_torch_multi_accelerator(test_case):
|
||||
@@ -947,6 +982,15 @@ def require_torch_gpu(test_case):
|
||||
return unittest.skipUnless(torch_device == "cuda", "test requires CUDA")(test_case)
|
||||
|
||||
|
||||
def require_torch_gpu_if_bnb_not_multi_backend_enabled(test_case):
|
||||
"""
|
||||
Decorator marking a test that requires a GPU if bitsandbytes multi-backend feature is not enabled.
|
||||
"""
|
||||
if is_bitsandbytes_available() and is_bitsandbytes_multi_backend_available():
|
||||
return test_case
|
||||
return require_torch_gpu(test_case)
|
||||
|
||||
|
||||
def require_torch_accelerator(test_case):
|
||||
"""Decorator marking a test that requires an accessible accelerator and PyTorch."""
|
||||
return unittest.skipUnless(torch_device is not None and torch_device != "cpu", "test requires accelerator")(
|
||||
|
||||
@@ -15,6 +15,9 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from functools import lru_cache
|
||||
from typing import FrozenSet
|
||||
|
||||
from huggingface_hub import get_full_repo_name # for backward compatibility
|
||||
from huggingface_hub.constants import HF_HUB_DISABLE_TELEMETRY as DISABLE_TELEMETRY # for backward compatibility
|
||||
from packaging import version
|
||||
@@ -118,6 +121,7 @@ from .import_utils import (
|
||||
is_auto_gptq_available,
|
||||
is_av_available,
|
||||
is_bitsandbytes_available,
|
||||
is_bitsandbytes_multi_backend_available,
|
||||
is_bs4_available,
|
||||
is_coloredlogs_available,
|
||||
is_cv2_available,
|
||||
@@ -277,3 +281,31 @@ def check_min_version(min_version):
|
||||
+ "Check out https://github.com/huggingface/transformers/tree/main/examples#important-note for the examples corresponding to other "
|
||||
"versions of HuggingFace Transformers."
|
||||
)
|
||||
|
||||
|
||||
@lru_cache()
|
||||
def get_available_devices() -> FrozenSet[str]:
|
||||
"""
|
||||
Returns a frozenset of devices available for the current PyTorch installation.
|
||||
"""
|
||||
devices = {"cpu"} # `cpu` is always supported as a device in PyTorch
|
||||
|
||||
if is_torch_cuda_available():
|
||||
devices.add("cuda")
|
||||
|
||||
if is_torch_mps_available():
|
||||
devices.add("mps")
|
||||
|
||||
if is_torch_xpu_available():
|
||||
devices.add("xpu")
|
||||
|
||||
if is_torch_npu_available():
|
||||
devices.add("npu")
|
||||
|
||||
if is_torch_mlu_available():
|
||||
devices.add("mlu")
|
||||
|
||||
if is_torch_musa_available():
|
||||
devices.add("musa")
|
||||
|
||||
return frozenset(devices)
|
||||
|
||||
@@ -849,15 +849,29 @@ def is_torch_xpu_available(check_device=False):
|
||||
return hasattr(torch, "xpu") and torch.xpu.is_available()
|
||||
|
||||
|
||||
@lru_cache()
|
||||
def is_bitsandbytes_available():
|
||||
if not is_torch_available():
|
||||
if not is_torch_available() or not _bitsandbytes_available:
|
||||
return False
|
||||
|
||||
# bitsandbytes throws an error if cuda is not available
|
||||
# let's avoid that by adding a simple check
|
||||
import torch
|
||||
|
||||
return _bitsandbytes_available and torch.cuda.is_available()
|
||||
# `bitsandbytes` versions older than 0.43.1 eagerly require CUDA at import time,
|
||||
# so those versions of the library are practically only available when CUDA is too.
|
||||
if version.parse(importlib.metadata.version("bitsandbytes")) < version.parse("0.43.1"):
|
||||
return torch.cuda.is_available()
|
||||
|
||||
# Newer versions of `bitsandbytes` can be imported on systems without CUDA.
|
||||
return True
|
||||
|
||||
|
||||
def is_bitsandbytes_multi_backend_available() -> bool:
|
||||
if not is_bitsandbytes_available():
|
||||
return False
|
||||
|
||||
import bitsandbytes as bnb
|
||||
|
||||
return "multi_backend" in getattr(bnb, "features", set())
|
||||
|
||||
|
||||
def is_flash_attn_2_available():
|
||||
|
||||
@@ -30,12 +30,13 @@ from transformers import (
|
||||
pipeline,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
apply_skip_if_not_implemented,
|
||||
is_bitsandbytes_available,
|
||||
is_torch_available,
|
||||
require_accelerate,
|
||||
require_bitsandbytes,
|
||||
require_torch,
|
||||
require_torch_gpu,
|
||||
require_torch_gpu_if_bnb_not_multi_backend_enabled,
|
||||
require_torch_multi_gpu,
|
||||
slow,
|
||||
torch_device,
|
||||
@@ -85,7 +86,7 @@ if is_bitsandbytes_available():
|
||||
@require_bitsandbytes
|
||||
@require_accelerate
|
||||
@require_torch
|
||||
@require_torch_gpu
|
||||
@require_torch_gpu_if_bnb_not_multi_backend_enabled
|
||||
@slow
|
||||
class Base4bitTest(unittest.TestCase):
|
||||
# We keep the constants inside the init function and model loading inside setUp function
|
||||
@@ -111,6 +112,7 @@ class Base4bitTest(unittest.TestCase):
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class Bnb4BitTest(Base4bitTest):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -206,7 +208,7 @@ class Bnb4BitTest(Base4bitTest):
|
||||
tok = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
text = "Hello my name is"
|
||||
input_ids = tok.encode(text, return_tensors="pt").to(0)
|
||||
input_ids = tok.encode(text, return_tensors="pt").to(torch_device)
|
||||
|
||||
_ = model.generate(input_ids, max_new_tokens=30)
|
||||
|
||||
@@ -217,7 +219,9 @@ class Bnb4BitTest(Base4bitTest):
|
||||
the same output across GPUs. So we'll generate few tokens (5-10) and check their output.
|
||||
"""
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = self.model_4bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = self.model_4bit.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -234,7 +238,7 @@ class Bnb4BitTest(Base4bitTest):
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model_4bit_from_config.generate(
|
||||
input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
@@ -252,7 +256,9 @@ class Bnb4BitTest(Base4bitTest):
|
||||
model_4bit.dequantize()
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model_4bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = model_4bit.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -267,15 +273,18 @@ class Bnb4BitTest(Base4bitTest):
|
||||
self.assertEqual(self.model_4bit.device.type, "cpu")
|
||||
self.assertAlmostEqual(self.model_4bit.get_memory_footprint(), mem_before)
|
||||
|
||||
# Move back to CUDA device
|
||||
self.model_4bit.to(0)
|
||||
self.assertEqual(self.model_4bit.device, torch.device(0))
|
||||
self.assertAlmostEqual(self.model_4bit.get_memory_footprint(), mem_before)
|
||||
if torch.cuda.is_available():
|
||||
# Move back to CUDA device
|
||||
self.model_4bit.to("cuda")
|
||||
self.assertEqual(self.model_4bit.device.type, "cuda")
|
||||
self.assertAlmostEqual(self.model_4bit.get_memory_footprint(), mem_before)
|
||||
|
||||
def test_device_and_dtype_assignment(self):
|
||||
r"""
|
||||
Test whether trying to cast (or assigning a device to) a model after converting it in 4-bit will throw an error.
|
||||
Checks also if other models are casted correctly.
|
||||
Test whether attempting to change the device or cast the dtype of a model
|
||||
after converting it to 4-bit precision will raise an appropriate error.
|
||||
The test ensures that such operations are prohibited on 4-bit models
|
||||
to prevent invalid conversions.
|
||||
"""
|
||||
|
||||
# Moving with `to` or `cuda` is not supported with versions < 0.43.2.
|
||||
@@ -297,25 +306,24 @@ class Bnb4BitTest(Base4bitTest):
|
||||
self.model_4bit.to(torch.float16)
|
||||
|
||||
with self.assertRaises(ValueError):
|
||||
# Tries with a `dtype` and `device`
|
||||
self.model_4bit.to(device="cuda:0", dtype=torch.float16)
|
||||
|
||||
with self.assertRaises(ValueError):
|
||||
# Tries with a cast
|
||||
# Tries to cast the 4-bit model to float32 using `float()`
|
||||
self.model_4bit.float()
|
||||
|
||||
with self.assertRaises(ValueError):
|
||||
# Tries with a cast
|
||||
# Tries to cast the 4-bit model to float16 using `half()`
|
||||
self.model_4bit.half()
|
||||
|
||||
# Test if we did not break anything
|
||||
self.model_4bit.to(torch.device(torch_device))
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
|
||||
self.model_fp16 = self.model_fp16.to(torch.float32)
|
||||
_ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
_ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
|
||||
|
||||
# Check that this does not throw an error
|
||||
_ = self.model_fp16.cuda()
|
||||
if torch.cuda.is_available():
|
||||
# Check that this does not throw an error
|
||||
_ = self.model_fp16.cuda()
|
||||
|
||||
# Check this does not throw an error
|
||||
_ = self.model_fp16.to("cpu")
|
||||
@@ -344,8 +352,9 @@ class Bnb4BitTest(Base4bitTest):
|
||||
@require_bitsandbytes
|
||||
@require_accelerate
|
||||
@require_torch
|
||||
@require_torch_gpu
|
||||
@require_torch_gpu_if_bnb_not_multi_backend_enabled
|
||||
@slow
|
||||
@apply_skip_if_not_implemented
|
||||
class Bnb4BitT5Test(unittest.TestCase):
|
||||
@classmethod
|
||||
def setUpClass(cls):
|
||||
@@ -375,14 +384,14 @@ class Bnb4BitT5Test(unittest.TestCase):
|
||||
|
||||
# test with `google-t5/t5-small`
|
||||
model = T5ForConditionalGeneration.from_pretrained(self.model_name, load_in_4bit=True, device_map="auto")
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
# test with `flan-t5-small`
|
||||
model = T5ForConditionalGeneration.from_pretrained(
|
||||
self.dense_act_model_name, load_in_4bit=True, device_map="auto"
|
||||
)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
T5ForConditionalGeneration._keep_in_fp32_modules = modules
|
||||
|
||||
@@ -400,17 +409,18 @@ class Bnb4BitT5Test(unittest.TestCase):
|
||||
# there was a bug with decoders - this test checks that it is fixed
|
||||
self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear4bit))
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
# test with `flan-t5-small`
|
||||
model = T5ForConditionalGeneration.from_pretrained(
|
||||
self.dense_act_model_name, load_in_4bit=True, device_map="auto"
|
||||
)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class Classes4BitModelTest(Base4bitTest):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -460,6 +470,7 @@ class Classes4BitModelTest(Base4bitTest):
|
||||
self.assertTrue(self.seq_to_seq_model.lm_head.weight.__class__ == torch.nn.Parameter)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class Pipeline4BitTest(Base4bitTest):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -469,7 +480,8 @@ class Pipeline4BitTest(Base4bitTest):
|
||||
TearDown function needs to be called at the end of each test to free the GPU memory and cache, also to
|
||||
avoid unexpected behaviors. Please see: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/27
|
||||
"""
|
||||
del self.pipe
|
||||
if hasattr(self, "pipe"):
|
||||
del self.pipe
|
||||
|
||||
gc.collect()
|
||||
torch.cuda.empty_cache()
|
||||
@@ -484,7 +496,12 @@ class Pipeline4BitTest(Base4bitTest):
|
||||
self.pipe = pipeline(
|
||||
"text-generation",
|
||||
model=self.model_name,
|
||||
model_kwargs={"device_map": "auto", "load_in_4bit": True, "torch_dtype": torch.float16},
|
||||
model_kwargs={
|
||||
"device_map": "auto",
|
||||
"load_in_4bit": True,
|
||||
# float16 isn't supported on CPU, use bfloat16 instead
|
||||
"torch_dtype": torch.bfloat16 if torch_device == "cpu" else torch.float16,
|
||||
},
|
||||
max_new_tokens=self.MAX_NEW_TOKENS,
|
||||
)
|
||||
|
||||
@@ -494,6 +511,7 @@ class Pipeline4BitTest(Base4bitTest):
|
||||
|
||||
|
||||
@require_torch_multi_gpu
|
||||
@apply_skip_if_not_implemented
|
||||
class Bnb4bitTestMultiGpu(Base4bitTest):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -515,10 +533,13 @@ class Bnb4bitTestMultiGpu(Base4bitTest):
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
|
||||
# Second real batch
|
||||
output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_parallel = model_parallel.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class Bnb4BitTestTraining(Base4bitTest):
|
||||
def setUp(self):
|
||||
self.model_name = "facebook/opt-350m"
|
||||
@@ -531,7 +552,10 @@ class Bnb4BitTestTraining(Base4bitTest):
|
||||
# Step 1: freeze all parameters
|
||||
model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_4bit=True)
|
||||
|
||||
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
|
||||
if torch.cuda.is_available():
|
||||
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
|
||||
else:
|
||||
self.assertTrue(all(param.device.type == "cpu" for param in model.parameters()))
|
||||
|
||||
for param in model.parameters():
|
||||
param.requires_grad = False # freeze the model - train adapters later
|
||||
@@ -547,10 +571,10 @@ class Bnb4BitTestTraining(Base4bitTest):
|
||||
module.v_proj = LoRALayer(module.v_proj, rank=16)
|
||||
|
||||
# Step 3: dummy batch
|
||||
batch = self.tokenizer("Test batch ", return_tensors="pt").to(0)
|
||||
batch = self.tokenizer("Test batch ", return_tensors="pt").to(torch_device)
|
||||
|
||||
# Step 4: Check if the gradient is not None
|
||||
with torch.cuda.amp.autocast():
|
||||
with torch.autocast(torch_device):
|
||||
out = model.forward(**batch)
|
||||
out.logits.norm().backward()
|
||||
|
||||
@@ -562,6 +586,7 @@ class Bnb4BitTestTraining(Base4bitTest):
|
||||
self.assertTrue(module.weight.grad is None)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class Bnb4BitGPT2Test(Bnb4BitTest):
|
||||
model_name = "openai-community/gpt2-xl"
|
||||
EXPECTED_RELATIVE_DIFFERENCE = 3.3191854854152187
|
||||
@@ -570,8 +595,9 @@ class Bnb4BitGPT2Test(Bnb4BitTest):
|
||||
@require_bitsandbytes
|
||||
@require_accelerate
|
||||
@require_torch
|
||||
@require_torch_gpu
|
||||
@require_torch_gpu_if_bnb_not_multi_backend_enabled
|
||||
@slow
|
||||
@apply_skip_if_not_implemented
|
||||
class BaseSerializationTest(unittest.TestCase):
|
||||
model_name = "facebook/opt-125m"
|
||||
input_text = "Mars colonists' favorite meals are"
|
||||
@@ -635,7 +661,9 @@ class BaseSerializationTest(unittest.TestCase):
|
||||
d1[k].quant_state.as_dict().values(),
|
||||
):
|
||||
if isinstance(v0, torch.Tensor):
|
||||
self.assertTrue(torch.equal(v0, v1.to(v0.device)))
|
||||
# The absmax will not be saved in the quant_state when using NF4 in CPU
|
||||
if v0.numel() != 0:
|
||||
self.assertTrue(torch.equal(v0, v1.to(v0.device)))
|
||||
else:
|
||||
self.assertTrue(v0 == v1)
|
||||
|
||||
@@ -659,6 +687,7 @@ class BaseSerializationTest(unittest.TestCase):
|
||||
)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class ExtendedSerializationTest(BaseSerializationTest):
|
||||
"""
|
||||
tests more combinations of parameters
|
||||
@@ -706,8 +735,9 @@ class GPTSerializationTest(BaseSerializationTest):
|
||||
|
||||
@require_bitsandbytes
|
||||
@require_accelerate
|
||||
@require_torch_gpu
|
||||
@require_torch_gpu_if_bnb_not_multi_backend_enabled
|
||||
@slow
|
||||
@apply_skip_if_not_implemented
|
||||
class Bnb4BitTestBasicConfigTest(unittest.TestCase):
|
||||
def test_load_in_4_and_8_bit_fails(self):
|
||||
with self.assertRaisesRegex(ValueError, "load_in_4bit and load_in_8bit are both True"):
|
||||
|
||||
@@ -30,14 +30,17 @@ from transformers import (
|
||||
pipeline,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
apply_skip_if_not_implemented,
|
||||
is_accelerate_available,
|
||||
is_bitsandbytes_available,
|
||||
is_torch_available,
|
||||
require_accelerate,
|
||||
require_bitsandbytes,
|
||||
require_torch,
|
||||
require_torch_gpu,
|
||||
require_torch_gpu_if_bnb_not_multi_backend_enabled,
|
||||
require_torch_multi_gpu,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
|
||||
|
||||
@@ -77,10 +80,14 @@ if is_torch_available():
|
||||
return self.module(input, *args, **kwargs) + self.adapter(input)
|
||||
|
||||
|
||||
if is_bitsandbytes_available():
|
||||
import bitsandbytes as bnb
|
||||
|
||||
|
||||
@require_bitsandbytes
|
||||
@require_accelerate
|
||||
@require_torch
|
||||
@require_torch_gpu
|
||||
@require_torch_gpu_if_bnb_not_multi_backend_enabled
|
||||
@slow
|
||||
class BaseMixedInt8Test(unittest.TestCase):
|
||||
# We keep the constants inside the init function and model loading inside setUp function
|
||||
@@ -108,6 +115,7 @@ class BaseMixedInt8Test(unittest.TestCase):
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class MixedInt8Test(BaseMixedInt8Test):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -240,7 +248,6 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
r"""
|
||||
A simple test to check if `llm_int8_skip_modules` works as expected
|
||||
"""
|
||||
import bitsandbytes as bnb
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=["classifier"])
|
||||
seq_classification_model = AutoModelForSequenceClassification.from_pretrained(
|
||||
@@ -263,7 +270,9 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
the same output across GPUs. So we'll generate few tokens (5-10) and check their output.
|
||||
"""
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = self.model_8bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = self.model_8bit.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -280,7 +289,7 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model_8bit_from_config.generate(
|
||||
input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
@@ -298,7 +307,9 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
model_8bit.dequantize()
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model_8bit.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = model_8bit.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -319,8 +330,10 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
|
||||
def test_device_and_dtype_assignment(self):
|
||||
r"""
|
||||
Test whether trying to cast (or assigning a device to) a model after converting it in 8-bit will throw an error.
|
||||
Checks also if other models are casted correctly.
|
||||
Test whether attempting to change the device or cast the dtype of a model
|
||||
after converting it to 8-bit precision will raise an appropriate error.
|
||||
The test ensures that such operations are prohibited on 8-bit models
|
||||
to prevent invalid conversions.
|
||||
"""
|
||||
with self.assertRaises(ValueError):
|
||||
# Tries with `str`
|
||||
@@ -332,21 +345,21 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
|
||||
with self.assertRaises(ValueError):
|
||||
# Tries with a `device`
|
||||
self.model_8bit.to(torch.device("cuda:0"))
|
||||
self.model_8bit.to(torch.device(torch_device))
|
||||
|
||||
with self.assertRaises(ValueError):
|
||||
# Tries with a `device`
|
||||
# Tries to cast the 8-bit model to float32 using `float()`
|
||||
self.model_8bit.float()
|
||||
|
||||
with self.assertRaises(ValueError):
|
||||
# Tries with a `device`
|
||||
# Tries to cast the 4-bit model to float16 using `half()`
|
||||
self.model_8bit.half()
|
||||
|
||||
# Test if we did not break anything
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
|
||||
self.model_fp16 = self.model_fp16.to(torch.float32)
|
||||
_ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
_ = self.model_fp16.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
|
||||
|
||||
# Check this does not throw an error
|
||||
_ = self.model_fp16.to("cpu")
|
||||
@@ -385,7 +398,9 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
|
||||
# generate
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model_from_saved.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = model_from_saved.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -410,7 +425,9 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
|
||||
# generate
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model_from_saved.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = model_from_saved.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -435,7 +452,9 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
|
||||
# generate
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model_from_saved.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = model_from_saved.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -455,7 +474,7 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
|
||||
# generate
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
@@ -463,7 +482,7 @@ class MixedInt8Test(BaseMixedInt8Test):
|
||||
@require_bitsandbytes
|
||||
@require_accelerate
|
||||
@require_torch
|
||||
@require_torch_gpu
|
||||
@require_torch_gpu_if_bnb_not_multi_backend_enabled
|
||||
@slow
|
||||
class MixedInt8T5Test(unittest.TestCase):
|
||||
@classmethod
|
||||
@@ -494,14 +513,14 @@ class MixedInt8T5Test(unittest.TestCase):
|
||||
|
||||
# test with `google-t5/t5-small`
|
||||
model = T5ForConditionalGeneration.from_pretrained(self.model_name, load_in_8bit=True, device_map="auto")
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
# test with `flan-t5-small`
|
||||
model = T5ForConditionalGeneration.from_pretrained(
|
||||
self.dense_act_model_name, load_in_8bit=True, device_map="auto"
|
||||
)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
T5ForConditionalGeneration._keep_in_fp32_modules = modules
|
||||
|
||||
@@ -511,7 +530,6 @@ class MixedInt8T5Test(unittest.TestCase):
|
||||
`flan-t5-small` uses `T5DenseGatedActDense` whereas `google-t5/t5-small` uses `T5DenseReluDense`. We need to test
|
||||
both cases.
|
||||
"""
|
||||
import bitsandbytes as bnb
|
||||
|
||||
from transformers import T5ForConditionalGeneration
|
||||
|
||||
@@ -521,14 +539,14 @@ class MixedInt8T5Test(unittest.TestCase):
|
||||
# there was a bug with decoders - this test checks that it is fixed
|
||||
self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear8bitLt))
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
# test with `flan-t5-small`
|
||||
model = T5ForConditionalGeneration.from_pretrained(
|
||||
self.dense_act_model_name, load_in_8bit=True, device_map="auto"
|
||||
)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
def test_inference_with_keep_in_fp32_serialized(self):
|
||||
@@ -538,7 +556,6 @@ class MixedInt8T5Test(unittest.TestCase):
|
||||
`flan-t5-small` uses `T5DenseGatedActDense` whereas `google-t5/t5-small` uses `T5DenseReluDense`. We need to test
|
||||
both cases.
|
||||
"""
|
||||
import bitsandbytes as bnb
|
||||
|
||||
from transformers import T5ForConditionalGeneration
|
||||
|
||||
@@ -553,14 +570,14 @@ class MixedInt8T5Test(unittest.TestCase):
|
||||
# there was a bug with decoders - this test checks that it is fixed
|
||||
self.assertTrue(isinstance(model.decoder.block[0].layer[0].SelfAttention.q, bnb.nn.Linear8bitLt))
|
||||
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
# test with `flan-t5-small`
|
||||
model = T5ForConditionalGeneration.from_pretrained(
|
||||
self.dense_act_model_name, load_in_8bit=True, device_map="auto"
|
||||
)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(0)
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt").to(torch_device)
|
||||
_ = model.generate(**encoded_input)
|
||||
|
||||
|
||||
@@ -614,6 +631,7 @@ class MixedInt8ModelClassesTest(BaseMixedInt8Test):
|
||||
self.assertTrue(self.seq_to_seq_model.lm_head.weight.__class__ == torch.nn.Parameter)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class MixedInt8TestPipeline(BaseMixedInt8Test):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -623,7 +641,8 @@ class MixedInt8TestPipeline(BaseMixedInt8Test):
|
||||
TearDown function needs to be called at the end of each test to free the GPU memory and cache, also to
|
||||
avoid unexpected behaviors. Please see: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/27
|
||||
"""
|
||||
del self.pipe
|
||||
if hasattr(self, "pipe"):
|
||||
del self.pipe
|
||||
|
||||
gc.collect()
|
||||
torch.cuda.empty_cache()
|
||||
@@ -648,6 +667,7 @@ class MixedInt8TestPipeline(BaseMixedInt8Test):
|
||||
|
||||
|
||||
@require_torch_multi_gpu
|
||||
@apply_skip_if_not_implemented
|
||||
class MixedInt8TestMultiGpu(BaseMixedInt8Test):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -669,11 +689,14 @@ class MixedInt8TestMultiGpu(BaseMixedInt8Test):
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
|
||||
# Second real batch
|
||||
output_parallel = model_parallel.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_parallel = model_parallel.generate(
|
||||
input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10
|
||||
)
|
||||
self.assertIn(self.tokenizer.decode(output_parallel[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
|
||||
@require_torch_multi_gpu
|
||||
@apply_skip_if_not_implemented
|
||||
class MixedInt8TestCpuGpu(BaseMixedInt8Test):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
@@ -683,7 +706,7 @@ class MixedInt8TestCpuGpu(BaseMixedInt8Test):
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
|
||||
# Check the exactness of the results
|
||||
output_parallel = model.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_parallel = model.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
|
||||
|
||||
# Get the generation
|
||||
output_text = self.tokenizer.decode(output_parallel[0], skip_special_tokens=True)
|
||||
@@ -819,6 +842,7 @@ class MixedInt8TestCpuGpu(BaseMixedInt8Test):
|
||||
self.check_inference_correctness(model_8bit)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class MixedInt8TestTraining(BaseMixedInt8Test):
|
||||
def setUp(self):
|
||||
self.model_name = "facebook/opt-350m"
|
||||
@@ -831,7 +855,10 @@ class MixedInt8TestTraining(BaseMixedInt8Test):
|
||||
# Step 1: freeze all parameters
|
||||
model = AutoModelForCausalLM.from_pretrained(self.model_name, load_in_8bit=True)
|
||||
|
||||
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
|
||||
if torch.cuda.is_available():
|
||||
self.assertEqual(set(model.hf_device_map.values()), {torch.cuda.current_device()})
|
||||
else:
|
||||
self.assertTrue(all(param.device.type == "cpu" for param in model.parameters()))
|
||||
|
||||
for param in model.parameters():
|
||||
param.requires_grad = False # freeze the model - train adapters later
|
||||
@@ -847,10 +874,10 @@ class MixedInt8TestTraining(BaseMixedInt8Test):
|
||||
module.v_proj = LoRALayer(module.v_proj, rank=16)
|
||||
|
||||
# Step 3: dummy batch
|
||||
batch = self.tokenizer("Test batch ", return_tensors="pt").to(0)
|
||||
batch = self.tokenizer("Test batch ", return_tensors="pt").to(torch_device)
|
||||
|
||||
# Step 4: Check if the gradient is not None
|
||||
with torch.cuda.amp.autocast():
|
||||
with torch.autocast(torch_device):
|
||||
out = model.forward(**batch)
|
||||
out.logits.norm().backward()
|
||||
|
||||
@@ -862,6 +889,7 @@ class MixedInt8TestTraining(BaseMixedInt8Test):
|
||||
self.assertTrue(module.weight.grad is None)
|
||||
|
||||
|
||||
@apply_skip_if_not_implemented
|
||||
class MixedInt8GPT2Test(MixedInt8Test):
|
||||
model_name = "openai-community/gpt2-xl"
|
||||
EXPECTED_RELATIVE_DIFFERENCE = 1.8720077507258357
|
||||
@@ -870,6 +898,9 @@ class MixedInt8GPT2Test(MixedInt8Test):
|
||||
EXPECTED_OUTPUTS.add("Hello my name is John Doe, and I'm a fan of the")
|
||||
# Expected values on a A10
|
||||
EXPECTED_OUTPUTS.add("Hello my name is John Doe, and I am a member of the")
|
||||
# Expected values on Intel CPU
|
||||
EXPECTED_OUTPUTS.add("Hello my name is John Doe. I am a man. I am")
|
||||
EXPECTED_OUTPUTS.add("Hello my name is John, and I'm a writer. I'm")
|
||||
|
||||
def test_int8_from_pretrained(self):
|
||||
r"""
|
||||
@@ -887,6 +918,6 @@ class MixedInt8GPT2Test(MixedInt8Test):
|
||||
|
||||
# generate
|
||||
encoded_input = self.tokenizer(self.input_text, return_tensors="pt")
|
||||
output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(0), max_new_tokens=10)
|
||||
output_sequences = model.generate(input_ids=encoded_input["input_ids"].to(torch_device), max_new_tokens=10)
|
||||
|
||||
self.assertIn(self.tokenizer.decode(output_sequences[0], skip_special_tokens=True), self.EXPECTED_OUTPUTS)
|
||||
|
||||
Reference in New Issue
Block a user