Qualcomm AI Engine Direct - [DO NOT MERGE] PTE size and Inference Speed Verification #7569

winskuo-quic · 2025-01-09T07:24:15Z

Summary

This is a draft to verify that hybrid mode models:

PTE Size: With weight sharing and deduplicate delegate cache can reduce pte size to 1.1GB.
Inference Speed: Achieve ~60tok/sec using hybrid mode on SM8650 with QNN 2.28.

pytorch-bot · 2025-01-09T07:24:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7569

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit f228d74 with merge base e00eaea ():

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
RuntimeError: Error checking labels: PR does not have required labels
Lint / lintrunner / linux-job (gh)
>>> Lint for examples/qualcomm/utils.py:
pull / test-phi-3-mini-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t c8dfa4b9541970430a9a041a03c84d72fa255e3bce3e982bd44ef61c0bbfc7fb /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-01-09T07:24:50Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

winskuo-quic · 2025-01-09T07:45:24Z

Hi @cccclai,
This draft pr (NOT INTENDED TO MERGE), applies deduplicate delegate cache, the calibration patch you mentioned in #7175, and also applied QNN 2.28 support.

For the runner, I have commented out the EOT condition so it can generate all the tokens, making it easier for us to keep track of the inference speed.

I have also sent you the PTE I used via email.
This PTE is generated with QNN 2.28, prefill=32, kv=512, ptq=16a4w, mode=hybrid.
I have attached the inference results and PTE size in the summary section.
Below is the script you can use to execute the pte file.

python examples/qualcomm/oss_scripts/llama3_2/llama.py -b build-android -s {DEVICE} -m SM8650 --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --prompt "what is 1+1" --temperature 0 --model_mode hybrid --prefill_seq_len 32 --kv_seq_len 512 --ptq 16a4w --pre_gen_pte {PATH_TO_MY_PTE} --model_size 1B

Please let me know if you cannot reproduce or run into any other issues.
Thanks

facebook-github-bot · 2025-01-13T18:33:01Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2025-01-14T18:31:15Z

I'm getting the following perf number

OP595DL1:/data/local/tmp/static_llama $ ./qnn_llama3_2_runner --model_path hybrid_llama3_2_qnn.pte  --tokenizer_path tokenizer.model  --prompt "what is the capital of the united states" --eval_mode 2 --output_path output.txt

I 00:00:00.000719 executorch:runner.cpp:53] creating module: model_path=hybrid_llama3_2_qnn.pte
I 00:00:00.000797 executorch:runner.cpp:55] creating runner: tokenizer_path=tokenizer.model
I 00:00:00.000803 executorch:runner.cpp:56] eval mode=2
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 2
[WARNING] [Qnn ExecuTorch]:  <W> Initializing HtpProvider

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000
[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: Use cached delegate handle for current method: kv_forward
I 00:00:00.497866 executorch:runner.cpp:135] creating io_memory
PyTorchObserver {"prompt_tokens":17,"generated_tokens":110,"model_load_start_ms":1736879427102,"model_load_end_ms":1736879428138,"inference_start_ms":1736879428138,"inference_end_ms":1736879430634,"prompt_eval_end_ms":1736879428188,"first_token_ms":1736879428188,"aggregate_sampling_time_ms":153,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:03.532760 executorch:runner.cpp:414] 	Prompt Tokens: 17    Generated Tokens: 110
I 00:00:03.532789 executorch:runner.cpp:420] 	Model Load Time:		1.036000 (seconds)
I 00:00:03.532802 executorch:runner.cpp:430] 	Total inference time:		2.496000 (seconds)		 Rate: 	44.070513 (tokens/second)
I 00:00:03.532813 executorch:runner.cpp:438] 		Prompt evaluation:	0.050000 (seconds)		 Rate: 	340.000000 (tokens/second)
I 00:00:03.532824 executorch:runner.cpp:449] 		Generated 110 tokens:	2.446000 (seconds)		 Rate: 	44.971382 (tokens/second)
I 00:00:03.532834 executorch:runner.cpp:457] 	Time to first generated token:	0.050000 (seconds)
I 00:00:03.532840 executorch:runner.cpp:464] 	Sampling time over 110 tokens:	0.153000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]:  <W> Function not called, PrepareLib isn't loaded!

with this commit and the .pte shared from you...

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 9, 2025

winskuo-quic mentioned this pull request Jan 9, 2025

Qualcomm AI Engine Direct - Support Hybrid Mode for Llama3.2 #7175

Merged

winskuo-quic added 3 commits January 21, 2025 13:33

Apply calibration patch and deduplicate delegate cache patch

3933cb6

Comment out eot condition to generate all tokens

b4a9e53

md5 to check quantized weights align

a6aee94

winskuo-quic force-pushed the dev1/winskuo/debug_llama3_2_speed_and_size branch from ffd7e8b to a6aee94 Compare January 21, 2025 07:32

add print

f228d74

winskuo-quic closed this Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - [DO NOT MERGE] PTE size and Inference Speed Verification #7569

Qualcomm AI Engine Direct - [DO NOT MERGE] PTE size and Inference Speed Verification #7569

winskuo-quic commented Jan 9, 2025 •

edited

Loading

pytorch-bot bot commented Jan 9, 2025 •

edited

Loading

github-actions bot commented Jan 9, 2025

winskuo-quic commented Jan 9, 2025

facebook-github-bot commented Jan 13, 2025

cccclai commented Jan 14, 2025 •

edited

Loading

Qualcomm AI Engine Direct - [DO NOT MERGE] PTE size and Inference Speed Verification #7569

Qualcomm AI Engine Direct - [DO NOT MERGE] PTE size and Inference Speed Verification #7569

Conversation

winskuo-quic commented Jan 9, 2025 • edited Loading

Summary

pytorch-bot bot commented Jan 9, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7569

❌ 3 New Failures

github-actions bot commented Jan 9, 2025

This PR needs a release notes: label

winskuo-quic commented Jan 9, 2025

facebook-github-bot commented Jan 13, 2025

cccclai commented Jan 14, 2025 • edited Loading

winskuo-quic commented Jan 9, 2025 •

edited

Loading

pytorch-bot bot commented Jan 9, 2025 •

edited

Loading

This PR needs a `release notes:` label

cccclai commented Jan 14, 2025 •

edited

Loading