-
Notifications
You must be signed in to change notification settings - Fork 539
Qualcomm AI Engine Direct - Add smart mask kv updator for llama3.2 #7694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Add smart mask kv updator for llama3.2 #7694
Conversation
chunit-quic
commented
Jan 16, 2025
- Add flag to use smart mask or shift pointer
- Add llama3_2 python with smart mask updator
- Change Memory class to IoMgrBase
- Change HybridMemory class to ShiftPointerIoMgr
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7694
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 593b866 with merge base e00eaea ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai, We provide a new KV cache updating mechanism that benefits long-length settings. python examples/qualcomm/oss_scripts/llama3_2/llama.py -b build-android -H ${HOST} -s ${DEVICE} -m "SM8650" --checkpoint ${Llama3.2-1B-Instruct}/consolidated.00.pth --params ${Llama3.2-1B-Instruct}/params.json --tokenizer_model ${Llama3.2-1B-Instruct/tokenizer.model} --prompt ${PROMPT} --ptq 16a4w --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 128 --kv_seq_len 4096 --kv_updator smart_mask Following this, a document PR for the updators will be submitted recently. |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this PR needs to be rebased as #7618 is merged?
574d835
to
4aee125
Compare
Thanks for pointing out! Just rebased. |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Import fail...I feel like it's still not rebased successfully |
- Add flag to use smart mask or shift pointer - Add llama3_2 python with smart mask updator - Change Memory class to IoMgrBase - Change HybridMemory class to ShiftPointerIoMgr
4aee125
to
593b866
Compare
I have rebased to the latest main branch again. Because I cannot check the import error, Feel free to let me know if it still fails. |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Hi, I am a little confused about why smart_mask updator is more beneficial for long-length settings? Can you help me? Thanks! |
Sure! The smart_mask updator is better for long-length settings because it uses shared memory, reducing data transfer between CPU and HTP. |
Thanks for your reply! Except using shared memory, smark_mask also reduce the memory allocation’s overhead? It seems that shift pointers needs to malloc new space for new kv cache every inference iteration. Aslo I wanna know why we can’t just keep all kv cache in the HTP side, then we don’t have the data transfer cost. |
No both of them allocate only once. You can check this init function implementations of class.
That's what the shared buffer help to achieve. :) |
Oh I see, thanks for your help. BTW I find in the static_llama.py, here implement prepare_feedfoward_conv, it seems like we convert original Linear to a Conv layer, why we should do that? Is convolution more effiecient that Linear with HTP? |
The reason is that we have specific optimization passes for convolution operations. Since in some case linear is mathematically equivalent to convolution, this conversion results in performance improvements. |