Skip to content

Qualcomm AI Engine Direct - Add 4-bit Embedding Quantization Option #7691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 3, 2025

Conversation

shewu-quic
Copy link
Collaborator

Summary:

  • Introduce 4-bit embedding quantization for prefill, kv, and hybrid mode
  • Fixed an assertion condition bug in the annotate_and_quant_scalar pass
  • Refactor passes in capture_program
  • Add topological sorting for passes in capture_program
  • Refactor export pte flow in hybrid mode

Reproduce command

Export pte with 4-bit embedding

python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ../artifacts/ -b build-android -H ${HOST} -s ${SERIAL} -m SM8650 --checkpoint ${checkpoint} --params ${param} --tokenizer_model tokenizer.model --prompt $'Could you tell me about Facebook?' --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 32 --kv_seq_len 512--ptq 16a4w --compile_only --embedding-quantize 4,32 --num_sharding 4

Run

python examples/qualcomm/oss_scripts/llama3_2/llama.py  -a ../artifacts/ -b build-android -H ${HOST} -s ${SERIAL}  -m SM8650 --checkpoint ${checkpoint} --params ${param} --tokenizer_model ../tokenizer.model --temperature 0 --model_size 1B --model_mode hybrid  --ptq 16a4w --prefill_seq_len 32 --kv_seq_len 512 --pre_gen_pte  ${pre_gen_pte}  --prompt "Could you tell me about Facebook?" --embedding-quantize 4,32 --num_sharding 4

Check story llama results

hybrid (kv=128, prefill=32) + 4-bit embedding (pte size: 142MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog barked and growled at her. Lily was scared and didn't know what to do.
Suddenly, a kind man came by and said, "Don't worry, I'll help you." He picked up Lily and carried her away from the dog. The man was very strong and brave.
After they were safe, the man said, "You were very brave to run away

hybrid (kv=128, prefill=32) (pte size: 223MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog barked and growled at her. Lily was scared and didn't know what to do.
Suddenly, a kind man came by and said, "Don't worry, I'll help you." He picked up Lily and carried her away from the dog. The man was very strong and brave.
After they were safe, the man said, "You were very brave to run away

prefill=32 + 4-bit embedding (pte size: 70MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a

prefill=32 (pte size: 104MB)

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a

llama 3.2 1B performance

hybrid (kv=512, prefill=32, num_sharding=4) + 4-bit embedding

I 00:00:01.529260 executorch:runner.cpp:456] 	Prompt Tokens: 17    Generated Tokens: 28
I 00:00:01.529274 executorch:runner.cpp:462] 	Model Load Time:		1.062000 (seconds)
I 00:00:01.529286 executorch:runner.cpp:472] 	Total inference time:		0.464000 (seconds)		 Rate: 	60.344828 (tokens/second)
I 00:00:01.529295 executorch:runner.cpp:480] 		Prompt evaluation:	0.042000 (seconds)		 Rate: 	404.761905 (tokens/second)
I 00:00:01.529303 executorch:runner.cpp:491] 		Generated 28 tokens:	0.422000 (seconds)		 Rate: 	66.350711 (tokens/second)
I 00:00:01.529313 executorch:runner.cpp:499] 	Time to first generated token:	0.042000 (seconds)
I 00:00:01.529321 executorch:runner.cpp:506] 	Sampling time over 28 tokens:	0.026000 (seconds)

hybrid (kv=512, prefill=32, num_sharding=4)

I 00:00:01.798075 executorch:runner.cpp:456] 	Prompt Tokens: 17    Generated Tokens: 39
I 00:00:01.798088 executorch:runner.cpp:462] 	Model Load Time:		1.171000 (seconds)
I 00:00:01.798098 executorch:runner.cpp:472] 	Total inference time:		0.624000 (seconds)		 Rate: 	62.500000 (tokens/second)
I 00:00:01.798107 executorch:runner.cpp:480] 		Prompt evaluation:	0.040000 (seconds)		 Rate: 	425.000000 (tokens/second)
I 00:00:01.798115 executorch:runner.cpp:491] 		Generated 39 tokens:	0.584000 (seconds)		 Rate: 	66.780822 (tokens/second)
I 00:00:01.798123 executorch:runner.cpp:499] 	Time to first generated token:	0.040000 (seconds)
I 00:00:01.798130 executorch:runner.cpp:506] 	Sampling time over 39 tokens:	0.029000 (seconds)

Copy link

pytorch-bot bot commented Jan 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7691

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e530130 with merge base a5c7609 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 16, 2025
@shewu-quic
Copy link
Collaborator Author

Hi @cccclai ,

This PR is to support 4-bit embedding cpu on static llama and refactor the passes in capture_program.
Could you help please take a look?

Thanks

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - probably need to rebase because #7618 is merged.

return passes


def _topological_sort_passes(passes: OrderedDict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run into any issue because the pass order isn't correct?

Copy link
Collaborator Author

@shewu-quic shewu-quic Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this function is to prevent users from unexpectedly modifying the order of passes in scripts.
I think it would be flexible that user wanna introduce their own implementation.

@shewu-quic shewu-quic force-pushed the dev1/hutton/enable_4bit_emb_cpu branch from aec4849 to 21c2250 Compare January 20, 2025 04:59
@shewu-quic
Copy link
Collaborator Author

Hi @cccclai,
Thanks for review! Just rebased.

@shewu-quic shewu-quic force-pushed the dev1/hutton/enable_4bit_emb_cpu branch 2 times, most recently from 7caa0d0 to a81ea21 Compare January 21, 2025 01:40
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Jan 30, 2025

It needs rebase again, given #7618 just reland and #7694 is also landed

Summary:
 - Introduce 4-bit embedding quantization for prefill, kv, and hybrid mode
 - Fixe an assertion condition bug in the annotate_and_quant_scalar pass
 - Refactor passes in capture_program
 - Add topological sorting for passes in capture_program
@shewu-quic shewu-quic force-pushed the dev1/hutton/enable_4bit_emb_cpu branch from a81ea21 to e530130 Compare February 3, 2025 05:26
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@cccclai cccclai merged commit 1d43d91 into pytorch:main Feb 3, 2025
44 of 46 checks passed
cccclai added a commit to cccclai/executorch-1 that referenced this pull request Feb 21, 2025
Summary: Regression from pytorch#8107, it causes buck run python binary fails. Then pytorch#7691 introduces dependency in source transformation

Reviewed By: larryliu0820

Differential Revision: D69942429
@cccclai cccclai mentioned this pull request Feb 21, 2025
cccclai added a commit to cccclai/executorch-1 that referenced this pull request Feb 21, 2025
Summary:

Regression from pytorch#8107, it causes buck run python binary fails. Then pytorch#7691 introduces dependency in source transformation

Reviewed By: larryliu0820, kirklandsign

Differential Revision: D69942429
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants