Skip to content

Qualcomm AI Engine Direct - Add smart mask kv updator for llama3.2 #7694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

chunit-quic
Copy link
Collaborator

  • Add flag to use smart mask or shift pointer
  • Add llama3_2 python with smart mask updator
  • Change Memory class to IoMgrBase
  • Change HybridMemory class to ShiftPointerIoMgr

Copy link

pytorch-bot bot commented Jan 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7694

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 593b866 with merge base e00eaea (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 16, 2025
@chunit-quic
Copy link
Collaborator Author

chunit-quic commented Jan 16, 2025

Hi @cccclai,

We provide a new KV cache updating mechanism that benefits long-length settings.
For example, with the SM8650 SoC and a KV length of 4096, it generates 47 tokens/sec (compared to 40 tokens/sec previously).

python examples/qualcomm/oss_scripts/llama3_2/llama.py -b build-android -H ${HOST} -s ${DEVICE} -m "SM8650" --checkpoint ${Llama3.2-1B-Instruct}/consolidated.00.pth --params ${Llama3.2-1B-Instruct}/params.json --tokenizer_model ${Llama3.2-1B-Instruct/tokenizer.model} --prompt ${PROMPT} --ptq 16a4w --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 128 --kv_seq_len 4096 --kv_updator smart_mask

Following this, a document PR for the updators will be submitted recently.
Feel free to ask us if you have any questions. Thank you! :D

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this PR needs to be rebased as #7618 is merged?

@chunit-quic
Copy link
Collaborator Author

I guess this PR needs to be rebased as #7618 is merged?

Thanks for pointing out! Just rebased.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Jan 20, 2025

Import fail...I feel like it's still not rebased successfully

- Add flag to use smart mask or shift pointer
- Add llama3_2 python with smart mask updator
- Change Memory class to IoMgrBase
- Change HybridMemory class to ShiftPointerIoMgr
@chunit-quic
Copy link
Collaborator Author

Import fail...I feel like it's still not rebased successfully

I have rebased to the latest main branch again. Because I cannot check the import error, Feel free to let me know if it still fails.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@facebook-github-bot facebook-github-bot merged commit 4796da7 into pytorch:main Jan 30, 2025
45 of 46 checks passed
@jds250
Copy link

jds250 commented Feb 2, 2025

Hi @cccclai,

We provide a new KV cache updating mechanism that benefits long-length settings. For example, with the SM8650 SoC and a KV length of 4096, it generates 47 tokens/sec (compared to 40 tokens/sec previously).

python examples/qualcomm/oss_scripts/llama3_2/llama.py -b build-android -H ${HOST} -s ${DEVICE} -m "SM8650" --checkpoint ${Llama3.2-1B-Instruct}/consolidated.00.pth --params ${Llama3.2-1B-Instruct}/params.json --tokenizer_model ${Llama3.2-1B-Instruct/tokenizer.model} --prompt ${PROMPT} --ptq 16a4w --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 128 --kv_seq_len 4096 --kv_updator smart_mask

Following this, a document PR for the updators will be submitted recently. Feel free to ask us if you have any questions. Thank you! :D

Hi, I am a little confused about why smart_mask updator is more beneficial for long-length settings? Can you help me? Thanks!

@chunit-quic
Copy link
Collaborator Author

Hi, I am a little confused about why smart_mask updator is more beneficial for long-length settings? Can you help me? Thanks!

Sure! The smart_mask updator is better for long-length settings because it uses shared memory, reducing data transfer between CPU and HTP.

@jds250
Copy link

jds250 commented Feb 3, 2025

Hi, I am a little confused about why smart_mask updator is more beneficial for long-length settings? Can you help me? Thanks!

Sure! The smart_mask updator is better for long-length settings because it uses shared memory, reducing data transfer between CPU and HTP.

Thanks for your reply! Except using shared memory, smark_mask also reduce the memory allocation’s overhead? It seems that shift pointers needs to malloc new space for new kv cache every inference iteration. Aslo I wanna know why we can’t just keep all kv cache in the HTP side, then we don’t have the data transfer cost.

@chunit-quic
Copy link
Collaborator Author

It seems that shift pointers needs to malloc new space for new kv cache every inference iteration.

No both of them allocate only once. You can check this init function implementations of class.

why we can’t just keep all kv cache in the HTP side, then we don’t have the data transfer cost.

That's what the shared buffer help to achieve. :)

@jds250
Copy link

jds250 commented Feb 6, 2025

It seems that shift pointers needs to malloc new space for new kv cache every inference iteration.

No both of them allocate only once. You can check this init function implementations of class.

why we can’t just keep all kv cache in the HTP side, then we don’t have the data transfer cost.

That's what the shared buffer help to achieve. :)

Oh I see, thanks for your help. BTW I find in the static_llama.py, here implement prepare_feedfoward_conv, it seems like we convert original Linear to a Conv layer, why we should do that? Is convolution more effiecient that Linear with HTP?

@chunit-quic
Copy link
Collaborator Author

why we should do that? Is convolution more effiecient that Linear with HTP?

The reason is that we have specific optimization passes for convolution operations. Since in some case linear is mathematically equivalent to convolution, this conversion results in performance improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants