FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)

* init vptq * add integration * add vptq support fix readme * add tests && format * format * address comments * format * format * address comments * format * address comments * remove debug code * Revert "remove debug code" This reverts commit ed3b3eaaba82caf58cb3aa6e865d98e49650cf66. * fix test --------- Co-authored-by: Yang Wang <wyatuestc@gmail.com>
2024-12-20 16:45:53 +08:00
parent 5a2aedca1e
commit 4e27a4009d
21 changed files with 647 additions and 3 deletions
--- a/docs/source/ko/_toctree.yml
+++ b/docs/source/ko/_toctree.yml
@@ -151,6 +151,8 @@
    title: AWQ
  - local: in_translation
    title: (번역중) AQLM
+  - local: in_translation
+    title: (번역중) VPTQ 
  - local: in_translation
    title: (번역중) Quanto
  - local: in_translation
@@ -173,6 +175,8 @@
    title: (번역중) AWQ
  - local: in_translation
    title: (번역중) AQLM
+  - local: in_translation
+    title: (번역중) VPTQ
  - local: quantization/quanto
    title: Quanto
  - local: quantization/eetq
--- a/docs/source/ko/llm_optims.md
+++ b/docs/source/ko/llm_optims.md
@@ -375,7 +375,7 @@ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable
 양자화는 LLM 가중치를 더 낮은 정밀도로 저장하여 크기를 줄입니다. 이는 메모리 사용량을 줄이며 GPU 메모리에 제약이 있는 경우 추론을 위해 LLM을 로드하는 것을 더 용이하게 합니다. GPU가 충분하다면, 모델을 양자화할 필요는 없습니다. 추가적인 양자화 및 양자화 해제 단계로 인해 약간의 지연이 발생할 수 있기 때문입니다(AWQ 및 융합 AWQ 모듈 제외).

 > [!TIP]
-> 다양한 양자화 라이브러리(자세한 내용은 [Quantization](./quantization) 가이드를 참조하십시오)가 있습니다. 여기에는 Quanto, AQLM, AWQ 및 AutoGPTQ가 포함됩니다. 사용 사례에 가장 잘 맞는 라이브러리를 사용해 보십시오. 또한 AutoGPTQ와 bitsandbytes를 비교하는 [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) 블로그 게시물을 읽어보는 것을 추천합니다.
+> 다양한 양자화 라이브러리(자세한 내용은 [Quantization](./quantization) 가이드를 참조하십시오)가 있습니다. 여기에는 Quanto, AQLM, VPTQ, AWQ 및 AutoGPTQ가 포함됩니다. 사용 사례에 가장 잘 맞는 라이브러리를 사용해 보십시오. 또한 AutoGPTQ와 bitsandbytes를 비교하는 [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) 블로그 게시물을 읽어보는 것을 추천합니다.

 아래의 모델 메모리 계산기를 사용하여 모델을 로드하는 데 필요한 메모리를 추정하고 비교해 보십시오. 예를 들어 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)를 로드하는 데 필요한 메모리를 추정해 보십시오.

--- a/docs/source/ko/main_classes/quantization.md
+++ b/docs/source/ko/main_classes/quantization.md
@@ -35,6 +35,10 @@ Transformers에서 지원되지 않는 양자화 기법들은 [`HfQuantizer`]

 [[autodoc]] AqlmConfig

+## VptqConfig[[transformers.VptqConfig]]
+
+[[autodoc]] VptqConfig
+
 ## AwqConfig[[transformers.AwqConfig]]

 [[autodoc]] AwqConfig