-
Notifications
You must be signed in to change notification settings - Fork 537
add 16a4w_hqq quant mode #3752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/cccclai/9/base
Are you sure you want to change the base?
add 16a4w_hqq quant mode #3752
Conversation
Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3752
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 9d92858 with merge base c665c17 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D57849772 |
Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/) ghstack-source-id: 227884756 Pull Request resolved: #3752
Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D57849772 |
Pull Request resolved: #3752 Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/) ghstack-source-id: 227950317
Needs to Install the latest hqq and torchao-nightly with
version reference
Also I'll be keep updating this pr, and this commit: 5900db7 is the tested working one. |
Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D57849772 |
Pull Request resolved: #3752 Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` ghstack-source-id: 227952016 Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/)
Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D57849772 |
Pull Request resolved: #3752 Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` ghstack-source-id: 228003732 Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/)
Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D57849772 |
Pull Request resolved: #3752 Prerequistie: install hqq following https://github.com/mobiusml/hqq Step 1: use hqq to quantize weight to 4bit Step 2: use static quant to quantize activation to 16bit Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration command: ``` python -m examples.models.llama2.eval_llama -t /data/users/chenlai/models/llama2/tokenizer.model -p /data/users/chenlai/models/llama2/params.json -c /data/users/chenlai/models/llama2/consolidated.00.pth --max_seq_len 129 -qmode 16a4w-hqq --limit 5 2>&1 | tee hqq_16a4w.log ``` ghstack-source-id: 228051126 Differential Revision: [D57849772](https://our.internmc.facebook.com/intern/diff/D57849772/)
@@ -0,0 +1,205 @@ | |||
# Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you want to move this to torchao: https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq?
Hi Chen,
Really appreciate your sharing! |
It should work with kv cache mode. It was just too slow when I started hqq at the beginning and I was trying to get some initial signal how the algo performs. The pr 3732 helped address perf for the kv cache version a bit.
Oh I think this line only predicts the next token, given the prompt, and it's equivalent to the version without kv cache
How long does it take for you to apply this algo to stories? Just so we can use it for reference. When I apply this quant mode to llama2, if it runs on CPU, it takes two hours to finish. If it's on GPU, it takes a few minutes. Also, if calibration takes too long, make reduce sample to 5. I set it to 30 but 5 can be quite reasonable. Also just a reminder, we need to set |
Yes, from my observation too.
Ahh, I observer that when I switched to the version without kv cache, running eval changed from taking 20 minutes to 30 seconds. Amazing!
Unfortunately, when I run it on GPU, it will OOM. Because my gpu, RTX 3080, only has 10G VRAM.
Thanks for your sharing. I will give it a shot on my model. |
Update my experiment. I set group_size to None, and the PPL really goes up. Our llama with hqq 16a4w: Baseline: |
Thanks! Yeah that's aligned with the observation on my side. How long does it take to run? Also I expect this model can generate reasonable response, likely will tend to stuttering though. |
When I update to this version, it takes about 2 hours on CPU Regarding the issue with mutable buffer, when we use your version, a lot of partitions will be generated because QNN does not support slice_scatter op. In our version, there will be no partitions. The difference is that we update KV Cache outside of llama, and you update KV Cache in each attention layer. We are thinking about whether we can use some transforms from your version to update KV Cache at the end, so that we can do research quantization in the same llama.
|
Actually can you try this? #3786 It will get rid of |
Great! I have tried, but we still get {num_layers} partitions due to index_put op. I need to figure out how to support by qnn. But I could get one partition when I use concat to update kv cache. BTW our llama with hqq 16a4w, I will get the following result
|
Yeah 60+ perplexity number is not great...I expect it to say some readable words but the quality won't be great.. Using concat sounds fine, as long as we have one partition...regarding
I expect to use |
I can check what spin quant output given the prompt |
Stack from ghstack (oldest at bottom):
Prerequistie: install hqq following https://github.com/mobiusml/hqq
Step 1: use hqq to quantize weight to 4bit
Step 2: use static quant to quantize activation to 16bit
Currently the graph calibration is too slow, so adding the the quant oberserver to the eager model for faster iteration
command:
Differential Revision: D57849772