[Docs / Awq] Add docs on exllamav2 + AWQ (#29474)

* add docs on exllamav2 + AWQ * Update docs/source/en/quantization.md
2024-03-06 06:30:47 +01:00
parent 00bf44270f
commit 2a002d073a
1 changed files with 39 additions and 0 deletions
--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@@ -196,6 +196,45 @@ The parameter `modules_to_fuse` should include:
 </hfoption>
 </hfoptions>
 ### Exllama-v2 support
 Recent versions of `autoawq` supports exllama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running:
 ```bash
 pip install git+https://github.com/casper-hansen/AutoAWQ.git
 ```
 Get started by passing an `AwqConfig()` with `version="exllama"`.
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
 quantization_config = AwqConfig(version="exllama")
 model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    quantization_config=quantization_config,
    device_map="auto",
 )
 input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
 output = model(input_ids)
 print(output.logits)
 tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
 input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
 output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 <Tip warning={true}>
 Note this feature is supported on AMD GPUs.
 </Tip>
 ## AutoGPTQ
 <Tip>