Skip to content

Qualcomm AI Engine Direct - [DO NOT MERGE] PTE size and Inference Speed Verification #7569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

winskuo-quic
Copy link
Collaborator

@winskuo-quic winskuo-quic commented Jan 9, 2025

Summary

This is a draft to verify that hybrid mode models:

  • PTE Size: With weight sharing and deduplicate delegate cache can reduce pte size to 1.1GB.
  • Inference Speed: Achieve ~60tok/sec using hybrid mode on SM8650 with QNN 2.28.

image

image

Copy link

pytorch-bot bot commented Jan 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7569

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit f228d74 with merge base e00eaea (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 9, 2025
Copy link

github-actions bot commented Jan 9, 2025

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@winskuo-quic
Copy link
Collaborator Author

Hi @cccclai,
This draft pr (NOT INTENDED TO MERGE), applies deduplicate delegate cache, the calibration patch you mentioned in #7175, and also applied QNN 2.28 support.

For the runner, I have commented out the EOT condition so it can generate all the tokens, making it easier for us to keep track of the inference speed.

I have also sent you the PTE I used via email.
This PTE is generated with QNN 2.28, prefill=32, kv=512, ptq=16a4w, mode=hybrid.
I have attached the inference results and PTE size in the summary section.
Below is the script you can use to execute the pte file.

python examples/qualcomm/oss_scripts/llama3_2/llama.py -b build-android -s {DEVICE} -m SM8650 --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --prompt "what is 1+1" --temperature 0 --model_mode hybrid --prefill_seq_len 32 --kv_seq_len 512 --ptq 16a4w --pre_gen_pte {PATH_TO_MY_PTE} --model_size 1B

Please let me know if you cannot reproduce or run into any other issues.
Thanks

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Jan 14, 2025

I'm getting the following perf number

OP595DL1:/data/local/tmp/static_llama $ ./qnn_llama3_2_runner --model_path hybrid_llama3_2_qnn.pte  --tokenizer_path tokenizer.model  --prompt "what is the capital of the united states" --eval_mode 2 --output_path output.txt

I 00:00:00.000719 executorch:runner.cpp:53] creating module: model_path=hybrid_llama3_2_qnn.pte
I 00:00:00.000797 executorch:runner.cpp:55] creating runner: tokenizer_path=tokenizer.model
I 00:00:00.000803 executorch:runner.cpp:56] eval mode=2
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 2
[WARNING] [Qnn ExecuTorch]:  <W> Initializing HtpProvider

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000
[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: Use cached delegate handle for current method: kv_forward
I 00:00:00.497866 executorch:runner.cpp:135] creating io_memory
PyTorchObserver {"prompt_tokens":17,"generated_tokens":110,"model_load_start_ms":1736879427102,"model_load_end_ms":1736879428138,"inference_start_ms":1736879428138,"inference_end_ms":1736879430634,"prompt_eval_end_ms":1736879428188,"first_token_ms":1736879428188,"aggregate_sampling_time_ms":153,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:03.532760 executorch:runner.cpp:414] 	Prompt Tokens: 17    Generated Tokens: 110
I 00:00:03.532789 executorch:runner.cpp:420] 	Model Load Time:		1.036000 (seconds)
I 00:00:03.532802 executorch:runner.cpp:430] 	Total inference time:		2.496000 (seconds)		 Rate: 	44.070513 (tokens/second)
I 00:00:03.532813 executorch:runner.cpp:438] 		Prompt evaluation:	0.050000 (seconds)		 Rate: 	340.000000 (tokens/second)
I 00:00:03.532824 executorch:runner.cpp:449] 		Generated 110 tokens:	2.446000 (seconds)		 Rate: 	44.971382 (tokens/second)
I 00:00:03.532834 executorch:runner.cpp:457] 	Time to first generated token:	0.050000 (seconds)
I 00:00:03.532840 executorch:runner.cpp:464] 	Sampling time over 110 tokens:	0.153000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

with this commit and the .pte shared from you...

@winskuo-quic winskuo-quic force-pushed the dev1/winskuo/debug_llama3_2_speed_and_size branch from ffd7e8b to a6aee94 Compare January 21, 2025 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants