-
Notifications
You must be signed in to change notification settings - Fork 539
Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7691
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit e530130 with merge base a5c7609 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai , This PR is to support 4-bit embedding cpu on static llama and refactor the passes in capture_program. Thanks |
1de8155
to
aec4849
Compare
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - probably need to rebase because #7618 is merged.
return passes | ||
|
||
|
||
def _topological_sort_passes(passes: OrderedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you run into any issue because the pass order isn't correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this function is to prevent users from unexpectedly modifying the order of passes in scripts.
I think it would be flexible that user wanna introduce their own implementation.
aec4849
to
21c2250
Compare
Hi @cccclai, |
7caa0d0
to
a81ea21
Compare
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: - Introduce 4-bit embedding quantization for prefill, kv, and hybrid mode - Fixe an assertion condition bug in the annotate_and_quant_scalar pass - Refactor passes in capture_program - Add topological sorting for passes in capture_program
a81ea21
to
e530130
Compare
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
Summary: Regression from pytorch#8107, it causes buck run python binary fails. Then pytorch#7691 introduces dependency in source transformation Reviewed By: larryliu0820 Differential Revision: D69942429
Summary: Regression from pytorch#8107, it causes buck run python binary fails. Then pytorch#7691 introduces dependency in source transformation Reviewed By: larryliu0820, kirklandsign Differential Revision: D69942429
Summary:
Reproduce command
Export pte with 4-bit embedding
Run
Check story llama results
hybrid (kv=128, prefill=32) + 4-bit embedding (pte size: 142MB)
hybrid (kv=128, prefill=32) (pte size: 223MB)
prefill=32 + 4-bit embedding (pte size: 70MB)
prefill=32 (pte size: 104MB)
llama 3.2 1B performance
hybrid (kv=512, prefill=32, num_sharding=4) + 4-bit embedding
hybrid (kv=512, prefill=32, num_sharding=4)