Qualcomm AI Engine Direct - Add smart mask kv updator for llama3.2 #7694

chunit-quic · 2025-01-16T08:08:57Z

Add flag to use smart mask or shift pointer
Add llama3_2 python with smart mask updator
Change Memory class to IoMgrBase
Change HybridMemory class to ShiftPointerIoMgr

pytorch-bot · 2025-01-16T08:09:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7694

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 593b866 with merge base e00eaea ():

NEW FAILURE - The following job has failed:

pull / test-phi-3-mini-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t 8ad39757bbd4d4124889cdc1610d79ee386091257aca88e348d65555e3cf6b4d /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

chunit-quic · 2025-01-16T08:27:38Z

Hi @cccclai,

We provide a new KV cache updating mechanism that benefits long-length settings.
For example, with the SM8650 SoC and a KV length of 4096, it generates 47 tokens/sec (compared to 40 tokens/sec previously).

python examples/qualcomm/oss_scripts/llama3_2/llama.py -b build-android -H ${HOST} -s ${DEVICE} -m "SM8650" --checkpoint ${Llama3.2-1B-Instruct}/consolidated.00.pth --params ${Llama3.2-1B-Instruct}/params.json --tokenizer_model ${Llama3.2-1B-Instruct/tokenizer.model} --prompt ${PROMPT} --ptq 16a4w --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 128 --kv_seq_len 4096 --kv_updator smart_mask

Following this, a document PR for the updators will be submitted recently.
Feel free to ask us if you have any questions. Thank you! :D

facebook-github-bot · 2025-01-20T00:11:42Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai

I guess this PR needs to be rebased as #7618 is merged?

chunit-quic · 2025-01-20T03:11:39Z

I guess this PR needs to be rebased as #7618 is merged?

Thanks for pointing out! Just rebased.

facebook-github-bot · 2025-01-20T19:11:22Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2025-01-20T22:47:39Z

Import fail...I feel like it's still not rebased successfully

- Add flag to use smart mask or shift pointer - Add llama3_2 python with smart mask updator - Change Memory class to IoMgrBase - Change HybridMemory class to ShiftPointerIoMgr

chunit-quic · 2025-01-21T00:39:26Z

Import fail...I feel like it's still not rebased successfully

I have rebased to the latest main branch again. Because I cannot check the import error, Feel free to let me know if it still fails.

facebook-github-bot · 2025-01-29T19:45:16Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai

Thanks!

jds250 · 2025-02-02T09:43:08Z

Hi @cccclai,

We provide a new KV cache updating mechanism that benefits long-length settings. For example, with the SM8650 SoC and a KV length of 4096, it generates 47 tokens/sec (compared to 40 tokens/sec previously).
python examples/qualcomm/oss_scripts/llama3_2/llama.py -b build-android -H ${HOST} -s ${DEVICE} -m "SM8650" --checkpoint ${Llama3.2-1B-Instruct}/consolidated.00.pth --params ${Llama3.2-1B-Instruct}/params.json --tokenizer_model ${Llama3.2-1B-Instruct/tokenizer.model} --prompt ${PROMPT} --ptq 16a4w --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 128 --kv_seq_len 4096 --kv_updator smart_mask
Following this, a document PR for the updators will be submitted recently. Feel free to ask us if you have any questions. Thank you! :D

Hi, I am a little confused about why smart_mask updator is more beneficial for long-length settings? Can you help me? Thanks!

chunit-quic · 2025-02-03T07:17:34Z

Hi, I am a little confused about why smart_mask updator is more beneficial for long-length settings? Can you help me? Thanks!

Sure! The smart_mask updator is better for long-length settings because it uses shared memory, reducing data transfer between CPU and HTP.

jds250 · 2025-02-03T11:44:22Z

Hi, I am a little confused about why smart_mask updator is more beneficial for long-length settings? Can you help me? Thanks!

Sure! The smart_mask updator is better for long-length settings because it uses shared memory, reducing data transfer between CPU and HTP.

Thanks for your reply! Except using shared memory, smark_mask also reduce the memory allocation’s overhead? It seems that shift pointers needs to malloc new space for new kv cache every inference iteration. Aslo I wanna know why we can’t just keep all kv cache in the HTP side, then we don’t have the data transfer cost.

chunit-quic · 2025-02-04T03:06:47Z

It seems that shift pointers needs to malloc new space for new kv cache every inference iteration.

No both of them allocate only once. You can check this init function implementations of class.

why we can’t just keep all kv cache in the HTP side, then we don’t have the data transfer cost.

That's what the shared buffer help to achieve. :)

jds250 · 2025-02-06T05:12:28Z

It seems that shift pointers needs to malloc new space for new kv cache every inference iteration.

No both of them allocate only once. You can check this init function implementations of class.

why we can’t just keep all kv cache in the HTP side, then we don’t have the data transfer cost.

That's what the shared buffer help to achieve. :)

Oh I see, thanks for your help. BTW I find in the static_llama.py, here implement prepare_feedfoward_conv, it seems like we convert original Linear to a Conv layer, why we should do that? Is convolution more effiecient that Linear with HTP?

chunit-quic · 2025-02-07T08:36:39Z

why we should do that? Is convolution more effiecient that Linear with HTP?

The reason is that we have specific optimization passes for convolution operations. Since in some case linear is mathematically equivalent to convolution, this conversion results in performance improvements.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 16, 2025

cccclai added the topic: not user facing label Jan 20, 2025

cccclai reviewed Jan 20, 2025

View reviewed changes

chunit-quic force-pushed the rebased_sharedbuffer branch from 574d835 to 4aee125 Compare January 20, 2025 03:10

shared buffer + smart mask

593b866

- Add flag to use smart mask or shift pointer - Add llama3_2 python with smart mask updator - Change Memory class to IoMgrBase - Change HybridMemory class to ShiftPointerIoMgr

chunit-quic force-pushed the rebased_sharedbuffer branch from 4aee125 to 593b866 Compare January 21, 2025 00:33

shewu-quic mentioned this pull request Jan 21, 2025

Error: input 3 is none #7614

Closed

cccclai approved these changes Jan 30, 2025

View reviewed changes

facebook-github-bot merged commit 4796da7 into pytorch:main Jan 30, 2025
45 of 46 checks passed

cccclai mentioned this pull request Jan 30, 2025

Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691

Merged

Uh oh!

Qualcomm AI Engine Direct - Add smart mask kv updator for llama3.2 #7694

Qualcomm AI Engine Direct - Add smart mask kv updator for llama3.2 #7694

Uh oh!

Conversation

chunit-quic commented Jan 16, 2025

Uh oh!

pytorch-bot bot commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7694

❌ 1 New Failure

Uh oh!

chunit-quic commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 20, 2025

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

chunit-quic commented Jan 20, 2025

Uh oh!

facebook-github-bot commented Jan 20, 2025

Uh oh!

cccclai commented Jan 20, 2025

Uh oh!

chunit-quic commented Jan 21, 2025

Uh oh!

facebook-github-bot commented Jan 29, 2025

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jds250 commented Feb 2, 2025

Uh oh!

chunit-quic commented Feb 3, 2025

Uh oh!

jds250 commented Feb 3, 2025

Uh oh!

chunit-quic commented Feb 4, 2025

Uh oh!

jds250 commented Feb 6, 2025

Uh oh!

chunit-quic commented Feb 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Jan 16, 2025 •

edited

Loading

chunit-quic commented Jan 16, 2025 •

edited

Loading