add exllamav2 arg (#26437)

* add_ xllamav2 arg

* add test

* style

* add check

* add doc

* replace by use_exllama_v2

* fix tests

* fix doc

* style

* better condition

* fix logic

* add deprecate msg
This commit is contained in:
Marc Sun
2023-10-26 16:15:05 +02:00
committed by GitHub
parent d7cb5e138e
commit 8214d6e7b1
4 changed files with 93 additions and 5 deletions

View File

@@ -128,12 +128,22 @@ For 4-bit model, you can use the exllama kernels in order to a faster inference
```py
import torch
gptq_config = GPTQConfig(bits=4, disable_exllama=False)
gptq_config = GPTQConfig(bits=4)
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config)
```
With the release of the exllamav2 kernels, you can get faster inference speed compared to the exllama kernels. You just need to
pass `use_exllama_v2=True` in [`GPTQConfig`] and disable exllama kernels:
```py
import torch
gptq_config = GPTQConfig(bits=4, use_exllama_v2=True)
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config)
```
Note that only 4-bit models are supported for now. Furthermore, it is recommended to deactivate the exllama kernels if you are finetuning a quantized model with peft.
You can find the benchmark of these kernels [here](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)
#### Fine-tune a quantized model
With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.