Skip to content

Commit 4e27a40

Browse files
wejoncyYangWang92
andauthored
FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)
* init vptq * add integration * add vptq support fix readme * add tests && format * format * address comments * format * format * address comments * format * address comments * remove debug code * Revert "remove debug code" This reverts commit ed3b3ea. * fix test --------- Co-authored-by: Yang Wang <wyatuestc@gmail.com>
1 parent 5a2aedc commit 4e27a40

21 files changed

Lines changed: 647 additions & 3 deletions

File tree

docker/transformers-quantization-latest-gpu/Dockerfile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/pef
5050
# Add aqlm for quantization testing
5151
RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
5252

53+
# Add vptq for quantization testing
54+
RUN python3 -m pip install --no-cache-dir vptq
55+
5356
# Add hqq for quantization testing
5457
RUN python3 -m pip install --no-cache-dir hqq
5558

docs/source/ar/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,8 @@
157157
# title: AWQ
158158
# - local: quantization/aqlm
159159
# title: AQLM
160+
# - local: quantization/vptq
161+
# title: VPTQ
160162
# - local: quantization/quanto
161163
# title: Quanto
162164
# - local: quantization/eetq

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@
167167
title: AWQ
168168
- local: quantization/aqlm
169169
title: AQLM
170+
- local: quantization/vptq
171+
title: VPTQ
170172
- local: quantization/quanto
171173
title: Quanto
172174
- local: quantization/eetq

docs/source/en/llm_optims.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -473,7 +473,7 @@ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable
473473
Quantization reduces the size of the LLM weights by storing them in a lower precision. This translates to lower memory usage and makes loading LLMs for inference more accessible if you're constrained by your GPUs memory. If you aren't limited by your GPU, you don't necessarily need to quantize your model because it can incur a small latency cost (except for AWQ and fused AWQ modules) due to the extra step required to quantize and dequantize the weights.
474474

475475
> [!TIP]
476-
> There are many quantization libraries (see the [Quantization](./quantization) guide for more details) available, such as Quanto, AQLM, AWQ, and AutoGPTQ. Feel free to try them out and see which one works best for your use case. We also recommend reading the [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) blog post which compares AutoGPTQ and bitsandbytes.
476+
> There are many quantization libraries (see the [Quantization](./quantization) guide for more details) available, such as Quanto, AQLM, VPTQ, AWQ, and AutoGPTQ. Feel free to try them out and see which one works best for your use case. We also recommend reading the [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) blog post which compares AutoGPTQ and bitsandbytes.
477477
478478
Use the Model Memory Calculator below to estimate and compare how much memory is required to load a model. For example, try estimating how much memory it costs to load [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
479479

docs/source/en/main_classes/quantization.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
3434

3535
[[autodoc]] AqlmConfig
3636

37+
## VptqConfig
38+
39+
[[autodoc]] VptqConfig
40+
3741
## AwqConfig
3842

3943
[[autodoc]] AwqConfig

docs/source/en/quantization/overview.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ Use the table below to help you decide which quantization method to use.
5858
| [optimum-quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 2 / 4 / 8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/optimum-quanto |
5959
| [FBGEMM_FP8](./fbgemm_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
6060
| [torchao](./torchao.md) | 🟢 | | 🟢 | 🔴 | partial support (int4 weight only) | 🔴 | | 4 / 8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
61+
| [VPTQ](./vptq) | 🔴 | 🔴 | 🟢 | 🟡 | 🔴 | 🔴 | 🟢 | 1 - 8 | 🔴 | 🟢 | 🟢 | https://github.com/microsoft/VPTQ |
6162

6263
<Tip>
6364

@@ -71,4 +72,4 @@ We value your feedback to help identify bugs before the full release! Check out
7172

7273
\** bitsandbytes is seeking contributors to help develop and lead the Apple Silicon backend. Interested? Contact them directly via their repo. Stipends may be available through sponsorships.
7374

74-
</Tip>
75+
</Tip>

docs/source/en/quantization/vptq.md

Lines changed: 111 additions & 0 deletions
Large diffs are not rendered by default.

docs/source/ko/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,8 @@
151151
title: AWQ
152152
- local: in_translation
153153
title: (번역중) AQLM
154+
- local: in_translation
155+
title: (번역중) VPTQ
154156
- local: in_translation
155157
title: (번역중) Quanto
156158
- local: in_translation
@@ -173,6 +175,8 @@
173175
title: (번역중) AWQ
174176
- local: in_translation
175177
title: (번역중) AQLM
178+
- local: in_translation
179+
title: (번역중) VPTQ
176180
- local: quantization/quanto
177181
title: Quanto
178182
- local: quantization/eetq

docs/source/ko/llm_optims.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -375,7 +375,7 @@ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable
375375
양자화는 LLM 가중치를 더 낮은 정밀도로 저장하여 크기를 줄입니다. 이는 메모리 사용량을 줄이며 GPU 메모리에 제약이 있는 경우 추론을 위해 LLM을 로드하는 것을 더 용이하게 합니다. GPU가 충분하다면, 모델을 양자화할 필요는 없습니다. 추가적인 양자화 및 양자화 해제 단계로 인해 약간의 지연이 발생할 수 있기 때문입니다(AWQ 및 융합 AWQ 모듈 제외).
376376

377377
> [!TIP]
378-
> 다양한 양자화 라이브러리(자세한 내용은 [Quantization](./quantization) 가이드를 참조하십시오)가 있습니다. 여기에는 Quanto, AQLM, AWQ 및 AutoGPTQ가 포함됩니다. 사용 사례에 가장 잘 맞는 라이브러리를 사용해 보십시오. 또한 AutoGPTQ와 bitsandbytes를 비교하는 [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) 블로그 게시물을 읽어보는 것을 추천합니다.
378+
> 다양한 양자화 라이브러리(자세한 내용은 [Quantization](./quantization) 가이드를 참조하십시오)가 있습니다. 여기에는 Quanto, AQLM, VPTQ, AWQ 및 AutoGPTQ가 포함됩니다. 사용 사례에 가장 잘 맞는 라이브러리를 사용해 보십시오. 또한 AutoGPTQ와 bitsandbytes를 비교하는 [Overview of natively supported quantization schemes in 🤗 Transformers](https://hf.co/blog/overview-quantization-transformers) 블로그 게시물을 읽어보는 것을 추천합니다.
379379
380380
아래의 모델 메모리 계산기를 사용하여 모델을 로드하는 데 필요한 메모리를 추정하고 비교해 보십시오. 예를 들어 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)를 로드하는 데 필요한 메모리를 추정해 보십시오.
381381

docs/source/ko/main_classes/quantization.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ Transformers에서 지원되지 않는 양자화 기법들은 [`HfQuantizer`]
3535

3636
[[autodoc]] AqlmConfig
3737

38+
## VptqConfig[[transformers.VptqConfig]]
39+
40+
[[autodoc]] VptqConfig
41+
3842
## AwqConfig[[transformers.AwqConfig]]
3943

4044
[[autodoc]] AwqConfig

0 commit comments

Comments
 (0)