Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691

shewu-quic · 2025-01-16T05:41:15Z

Summary:

Introduce 4-bit embedding quantization for prefill, kv, and hybrid mode
Fixed an assertion condition bug in the annotate_and_quant_scalar pass
Refactor passes in capture_program
Add topological sorting for passes in capture_program
Refactor export pte flow in hybrid mode

Reproduce command

Export pte with 4-bit embedding

python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ../artifacts/ -b build-android -H ${HOST} -s ${SERIAL} -m SM8650 --checkpoint ${checkpoint} --params ${param} --tokenizer_model tokenizer.model --prompt $'Could you tell me about Facebook?' --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 32 --kv_seq_len 512--ptq 16a4w --compile_only --embedding-quantize 4,32 --num_sharding 4

Run

python examples/qualcomm/oss_scripts/llama3_2/llama.py  -a ../artifacts/ -b build-android -H ${HOST} -s ${SERIAL}  -m SM8650 --checkpoint ${checkpoint} --params ${param} --tokenizer_model ../tokenizer.model --temperature 0 --model_size 1B --model_mode hybrid  --ptq 16a4w --prefill_seq_len 32 --kv_seq_len 512 --pre_gen_pte  ${pre_gen_pte}  --prompt "Could you tell me about Facebook?" --embedding-quantize 4,32 --num_sharding 4

Check story llama results

hybrid (kv=128, prefill=32) + 4-bit embedding (pte size: 142MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog barked and growled at her. Lily was scared and didn't know what to do.
Suddenly, a kind man came by and said, "Don't worry, I'll help you." He picked up Lily and carried her away from the dog. The man was very strong and brave.
After they were safe, the man said, "You were very brave to run away

hybrid (kv=128, prefill=32) (pte size: 223MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog barked and growled at her. Lily was scared and didn't know what to do.
Suddenly, a kind man came by and said, "Don't worry, I'll help you." He picked up Lily and carried her away from the dog. The man was very strong and brave.
After they were safe, the man said, "You were very brave to run away

prefill=32 + 4-bit embedding (pte size: 70MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a

prefill=32 (pte size: 104MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a

llama 3.2 1B performance

hybrid (kv=512, prefill=32, num_sharding=4) + 4-bit embedding

I 00:00:01.529260 executorch:runner.cpp:456] 	Prompt Tokens: 17    Generated Tokens: 28
I 00:00:01.529274 executorch:runner.cpp:462] 	Model Load Time:		1.062000 (seconds)
I 00:00:01.529286 executorch:runner.cpp:472] 	Total inference time:		0.464000 (seconds)		 Rate: 	60.344828 (tokens/second)
I 00:00:01.529295 executorch:runner.cpp:480] 		Prompt evaluation:	0.042000 (seconds)		 Rate: 	404.761905 (tokens/second)
I 00:00:01.529303 executorch:runner.cpp:491] 		Generated 28 tokens:	0.422000 (seconds)		 Rate: 	66.350711 (tokens/second)
I 00:00:01.529313 executorch:runner.cpp:499] 	Time to first generated token:	0.042000 (seconds)
I 00:00:01.529321 executorch:runner.cpp:506] 	Sampling time over 28 tokens:	0.026000 (seconds)

hybrid (kv=512, prefill=32, num_sharding=4)

I 00:00:01.798075 executorch:runner.cpp:456] 	Prompt Tokens: 17    Generated Tokens: 39
I 00:00:01.798088 executorch:runner.cpp:462] 	Model Load Time:		1.171000 (seconds)
I 00:00:01.798098 executorch:runner.cpp:472] 	Total inference time:		0.624000 (seconds)		 Rate: 	62.500000 (tokens/second)
I 00:00:01.798107 executorch:runner.cpp:480] 		Prompt evaluation:	0.040000 (seconds)		 Rate: 	425.000000 (tokens/second)
I 00:00:01.798115 executorch:runner.cpp:491] 		Generated 39 tokens:	0.584000 (seconds)		 Rate: 	66.780822 (tokens/second)
I 00:00:01.798123 executorch:runner.cpp:499] 	Time to first generated token:	0.040000 (seconds)
I 00:00:01.798130 executorch:runner.cpp:506] 	Sampling time over 39 tokens:	0.029000 (seconds)

pytorch-bot · 2025-01-16T05:41:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7691

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e530130 with merge base a5c7609 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-eval_llama-mmlu-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-llava-runner-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shewu-quic · 2025-01-16T05:45:42Z

Hi @cccclai ,

This PR is to support 4-bit embedding cpu on static llama and refactor the passes in capture_program.
Could you help please take a look?

Thanks

facebook-github-bot · 2025-01-20T00:11:16Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai

Looks good - probably need to rebase because #7618 is merged.

cccclai · 2025-01-20T02:06:18Z

backends/qualcomm/utils/utils.py

+    return passes
+
+
+def _topological_sort_passes(passes: OrderedDict):


Did you run into any issue because the pass order isn't correct?

No, this function is to prevent users from unexpectedly modifying the order of passes in scripts.
I think it would be flexible that user wanna introduce their own implementation.

shewu-quic · 2025-01-20T05:00:50Z

Hi @cccclai,
Thanks for review! Just rebased.

facebook-github-bot · 2025-01-30T05:50:41Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2025-01-30T06:54:23Z

It needs rebase again, given #7618 just reland and #7694 is also landed

Summary: - Introduce 4-bit embedding quantization for prefill, kv, and hybrid mode - Fixe an assertion condition bug in the annotate_and_quant_scalar pass - Refactor passes in capture_program - Add topological sorting for passes in capture_program

facebook-github-bot · 2025-02-03T05:56:02Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai

Looks good, thank you!

Summary: Regression from pytorch#8107, it causes buck run python binary fails. Then pytorch#7691 introduces dependency in source transformation Reviewed By: larryliu0820 Differential Revision: D69942429

Summary: Regression from pytorch#8107, it causes buck run python binary fails. Then pytorch#7691 introduces dependency in source transformation Reviewed By: larryliu0820, kirklandsign Differential Revision: D69942429

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 16, 2025

shewu-quic force-pushed the dev1/hutton/enable_4bit_emb_cpu branch from 1de8155 to aec4849 Compare January 17, 2025 08:51

cccclai added the topic: not user facing label Jan 20, 2025

cccclai reviewed Jan 20, 2025

View reviewed changes

shewu-quic force-pushed the dev1/hutton/enable_4bit_emb_cpu branch from aec4849 to 21c2250 Compare January 20, 2025 04:59

shewu-quic force-pushed the dev1/hutton/enable_4bit_emb_cpu branch 2 times, most recently from 7caa0d0 to a81ea21 Compare January 21, 2025 01:40

shewu-quic added 2 commits February 2, 2025 19:02

fix wrong import

e530130

shewu-quic force-pushed the dev1/hutton/enable_4bit_emb_cpu branch from a81ea21 to e530130 Compare February 3, 2025 05:26

cccclai approved these changes Feb 3, 2025

View reviewed changes

cccclai merged commit 1d43d91 into pytorch:main Feb 3, 2025
44 of 46 checks passed

cccclai mentioned this pull request Feb 21, 2025

fix export llama to qnn #8608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691

Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691

shewu-quic commented Jan 16, 2025

pytorch-bot bot commented Jan 16, 2025 •

edited

Loading

shewu-quic commented Jan 16, 2025

facebook-github-bot commented Jan 20, 2025

cccclai left a comment

cccclai Jan 20, 2025

shewu-quic Jan 20, 2025 •

edited

Loading

shewu-quic commented Jan 20, 2025

facebook-github-bot commented Jan 30, 2025

cccclai commented Jan 30, 2025

facebook-github-bot commented Feb 3, 2025

cccclai left a comment

		return passes


		def _topological_sort_passes(passes: OrderedDict):

Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691

Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691

Conversation

shewu-quic commented Jan 16, 2025

Reproduce command

Check story llama results

llama 3.2 1B performance

pytorch-bot bot commented Jan 16, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7691

✅ You can merge normally! (2 Unrelated Failures)

shewu-quic commented Jan 16, 2025

facebook-github-bot commented Jan 20, 2025

cccclai left a comment

Choose a reason for hiding this comment

cccclai Jan 20, 2025

Choose a reason for hiding this comment

shewu-quic Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

shewu-quic commented Jan 20, 2025

facebook-github-bot commented Jan 30, 2025

cccclai commented Jan 30, 2025

facebook-github-bot commented Feb 3, 2025

cccclai left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Jan 16, 2025 •

edited

Loading

shewu-quic Jan 20, 2025 •

edited

Loading