[Executorch][quant] Optimize per channel dequantize #5670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

facebook-github-bot merged 11 commits into gh/kimishpatel/113/base from gh/kimishpatel/113/head

Dec 2, 2024

Contributor

kimishpatel commented Sep 25, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: D63338858


          [Executorch][quant] Optimize per channel dequantize

bdc1a33

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5670

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0775033 with merge base c726a9b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

kimishpatel mentioned this pull request

[ExecuTorch] Some updated to kv cache #5663

Closed

Contributor

facebook-github-bot commented Sep 25, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858

kimishpatel mentioned this pull request

Fix dequantize per channel to handle double scale type #5524

Closed

facebook-github-bot added the fb-exported label

This was referenced Sep 25, 2024

[ExecuTorch] Add quantized kv cache to llama #5664

Closed

Refactor custom SDPA op to separate kv cache update from the custom sdpa op #5665

Closed

Add update_quantized_cache op #5527

Closed

[Executorch][llama] Update SDPA op to use quantized kv cache #5666

Closed

[Executorch][llama] Refactoring sdpa #5667

Closed

[Executorch] Update EXECUTORCH_LIBRARY macro #5668

Closed

[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5669

Closed

kimishpatel added a commit that referenced this pull request


          [Executorch][quant] Optimize per channel dequantize

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

ghstack-source-id: 244715672
Pull Request resolved: #5670


          Update on "[Executorch][quant] Optimize per channel dequantize"

c8b3e00

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Sep 26, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858

kimishpatel mentioned this pull request

Dont quantize the current token for attention #5715

Merged


          Update on "[Executorch][quant] Optimize per channel dequantize"

45530bb

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Sep 28, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858

kimishpatel added a commit that referenced this pull request


          [Executorch][quant] Optimize per channel dequantize

9c3d846

Pull Request resolved: #5670

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.
ghstack-source-id: 245231655
@exported-using-ghexport

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)


          Update on "[Executorch][quant] Optimize per channel dequantize"

5d799e8

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Sep 30, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858

digantdesai approved these changes

View reviewed changes


          Update on "[Executorch][quant] Optimize per channel dequantize"

f63c0f3

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 1, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858


          Update on "[Executorch][quant] Optimize per channel dequantize"

e505c03

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 1, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858


          Update on "[Executorch][quant] Optimize per channel dequantize"

917597e

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 1, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858


          Update on "[Executorch][quant] Optimize per channel dequantize"

c1658eb

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 3, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858


          Update on "[Executorch][quant] Optimize per channel dequantize"

cc55a2a

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 16, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858

This was referenced Nov 16, 2024

[Executorch][llama] Rename update_quantized_cache to update_cache #6914

Merged

[Executorch][BE] Rename sdpa_with_kv_cache.py to custom_ops.py #6996

Merged

[Executorch] Add quantized kv cache to oss ci #6997

Merged


          Update on "[Executorch][quant] Optimize per channel dequantize"

e3cefc5

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 21, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858

kimishpatel added the topic: not user facing label

This was referenced Nov 22, 2024

[Executorch][custom ops] Change lib loading logic to account for package dir #7038

Merged

[Executorch][CI] Fix qnn runner ci job scripts #7049

Closed


          Update on "[Executorch][quant] Optimize per channel dequantize"

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 28, 2024

This pull request was exported from Phabricator. Differential Revision: D63338858

facebook-github-bot merged commit 3046412 into gh/kimishpatel/113/base

42 checks passed

facebook-github-bot deleted the gh/kimishpatel/113/head branch

December 2, 2024 16:19

facebook-github-bot temporarily deployed to cherry-pick-bot

December 2, 2024 16:19

— with

GitHub Actions Inactive

pytorchbot mentioned this pull request

[Executorch][quant] Optimize per channel dequantize #7139

Merged

kirklandsign pushed a commit that referenced this pull request


          [Executorch][quant] Optimize per channel dequantize

ddec0c7

Pull Request resolved: #5670

When using quantized kv cache, dequantization routine takes significantly long.
This diff just vectorizes dequant per channel for common case.
ghstack-source-id: 255730818
@exported-using-ghexport

Differential Revision: [D63338858](https://our.internmc.facebook.com/intern/diff/D63338858/)

Co-authored-by: Kimish Patel <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported topic: not user facing