add exllamav2 arg (#26437)

* add_ xllamav2 arg * add test * style * add check * add doc * replace by use_exllama_v2 * fix tests * fix doc * style * better condition * fix logic * add deprecate msg
2023-10-26 16:15:05 +02:00
parent d7cb5e138e
commit 8214d6e7b1
4 changed files with 93 additions and 5 deletions
--- a/docs/source/en/main_classes/quantization.md
+++ b/docs/source/en/main_classes/quantization.md
@@ -128,12 +128,22 @@ For 4-bit model, you can use the exllama kernels in order to a faster inference

 ```py
 import torch
-gptq_config = GPTQConfig(bits=4, disable_exllama=False)
+gptq_config = GPTQConfig(bits=4)
+model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config)
+```
+
+With the release of the exllamav2 kernels, you can get faster inference speed compared to the exllama kernels. You just need to 
+pass `use_exllama_v2=True` in [`GPTQConfig`] and disable exllama kernels:
+
+```py
+import torch
+gptq_config = GPTQConfig(bits=4, use_exllama_v2=True)
 model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config)
 ```

 Note that only 4-bit models are supported for now. Furthermore, it is recommended to deactivate the exllama kernels if you are finetuning a quantized model with peft. 

+You can find the benchmark of these kernels [here](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)
 #### Fine-tune a quantized model 

 With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.