Qualcomm AI Engine Direct - Enable AR-N model for prompt processing in hybrid mode #8210

shewu-quic · 2025-02-05T12:57:47Z

Summary:

Add --max_seq_len to refer to maximum number of tokens that the model can process & consider at once to generate predictions/responses.
Add --prefill_ar_n to determine the number of tokens to consume and the number of logits to produce for prompt processor in hybrid mode.
Remove prefill mode
The best AR-N depends on model size, use case, HW capabilities, etc.
- Suggest to profile on-target latency measurements and decide best config

Test Plan

Try to find the best AR-N with CL=2048 in hybrid mode
- Based on the below table, the best AR-N is 256
Ensure accuracy for Stories Llama 16a4w, Prompt: "Once upon a time"
- Before(Current Mainline) KV mode

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy was a big, red ball. One day, Lily's mom asked her to help her with the laundry. Lily was happy to help and she put all the clothes in the washing machine. After the clothes were washed, Lily's mom asked her to help her hang them up to dry. Lily saw a big, black sheet hanging on the line and she wanted to help. She grabbed the sheet and tried to hang it up

After(This PR) Shift Pointer and prefill_ar_len = 8 and max_seq_len=128 in hybrid mode

I 00:00:00.377477 executorch:runner.cpp:346] Prompt Processor: total 5 tokens (AR-8 * 1 iters)
I 00:00:00.927398 executorch:runner.cpp:446] 	Prompt Tokens: 5    Generated Tokens: 122
I 00:00:00.927487 executorch:runner.cpp:452] 	Model Load Time:		0.375000 (seconds)
I 00:00:00.927502 executorch:runner.cpp:462] 	Total inference time:		0.550000 (seconds)		 Rate: 	221.818182 (tokens/second)
I 00:00:00.927513 executorch:runner.cpp:470] 		Prompt evaluation:	0.008000 (seconds)		 Rate: 	625.000000 (tokens/second)
I 00:00:00.927522 executorch:runner.cpp:481] 		Generated 122 tokens:	0.542000 (seconds)		 Rate: 	225.092251 (tokens/second)
I 00:00:00.927530 executorch:runner.cpp:489] 	Time to first generated token:	0.008000 (seconds)
I 00:00:00.927538 executorch:runner.cpp:496] 	Sampling time over 122 tokens:	0.025000 (seconds)

INFO:root:Results[0]:
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy was a big, red ball. One day, Lily's mom asked her to help her with the laundry. Lily was happy to help and she put all the clothes in the washing machine. 
After the clothes were washed, Lily's mom asked her to help her hang them up to dry. Lily saw a big, black iron on the counter and asked her mom what it was for. Her mom explained that it was used to make clothes look

After(This PR) Smart Mask and prefill_ar_len = 8 and max_seq_len=128 in hybrid mode

I 00:00:00.928392 executorch:runner.cpp:446] 	Prompt Tokens: 5    Generated Tokens: 122
I 00:00:00.928457 executorch:runner.cpp:452] 	Model Load Time:		0.367000 (seconds)
I 00:00:00.928473 executorch:runner.cpp:462] 	Total inference time:		0.559000 (seconds)		 Rate: 	218.246869 (tokens/second)
I 00:00:00.928484 executorch:runner.cpp:470] 		Prompt evaluation:	0.039000 (seconds)		 Rate: 	128.205128 (tokens/second)
I 00:00:00.928493 executorch:runner.cpp:481] 		Generated 122 tokens:	0.520000 (seconds)		 Rate: 	234.615385 (tokens/second)
I 00:00:00.928501 executorch:runner.cpp:489] 	Time to first generated token:	0.039000 (seconds)
I 00:00:00.928509 executorch:runner.cpp:496] 	Sampling time over 122 tokens:	0.035000 (seconds)

INFO:root:Results[0]:
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy was a big, red ball. One day, Lily's mom asked her to help her with the laundry. Lily was happy to help and she put all the clothes in the washing machine. 
After the clothes were washed, Lily's mom asked her to help her hang them up to dry. Lily saw a big, black iron on the counter and asked her mom what it was for. Her mom explained that it was used to make clothes smooth

Command

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android --checkpoint ~/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth --params ~/.llama/checkpoints/Llama3.2-1B-Instruct/params.json --tokenizer_model ~/.llama/checkpoints/Llama3.2-1B-Instruct/tokenizer.model --prompt $'<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' --temperature 0 --llama_model llama3_2 --model_mode hybrid --ptq 16a4w -m SM8650 -H ${HOST} -s ${DEVICE}-a ${ARTIFACTS} --max_seq_len 2048 --prefill_ar_len 256 --num_sharding 4 --kv_updater shift_pointer

pytorch-bot · 2025-02-05T12:57:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8210

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7d9a14e with merge base 77589c6 ():

NEW FAILURE - The following job has failed:

pull / unittest / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shewu-quic · 2025-02-17T02:58:02Z

Hi @cccclai, @billmguo,

This PR enables the AR-N model for prompt processing in hybrid mode.
Could you please help to take a look?

Regarding the change in lower_module_backend.py, it is intended to prevent a double deletion of the persistent buffer. I observed that the buffer (freq_cos and freq_sin) is copied to each delegate node (due to graph sharding), and the original buffer is eventually deleted. Since each copied buffer shares the same target, this would result in a double deletion.

facebook-github-bot · 2025-02-18T17:56:41Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2025-02-18T21:22:00Z

Awesome! Thank you for getting it to work so quickly. Can you help fix these errors?

executorch/examples/qualcomm/oss_scripts/llama/runner/io_manager.cpp:574:7: error: unused variable 'ptr' [-Werror,-Wunused-variable]
  574 |   IO* ptr = static_cast<IO*>(data_ptr_.get());
      |       ^~~
executorch/examples/qualcomm/oss_scripts/llama/runner/io_manager.cpp:1089:11: error: unused variable 'cache_len' [-Werror,-Wunused-variable]
 1089 |   int32_t cache_len = methods_meta[0]->input_tensor_meta(0)->sizes()[1];
      |           ^~~~~~~~~
executorch/examples/qualcomm/oss_scripts/llama/runner/io_manager.cpp:1306:7: error: unused variable 'ptr' [-Werror,-Wunused-variable]
 1306 |   IO* ptr = static_cast<IO*>(data_ptr_.get());

cccclai · 2025-02-18T21:23:53Z

examples/qualcomm/oss_scripts/llama/README.md

 KV Cache Mode: In KV Cache mode, the model takes in a single previous token and generates the next predicted token along with its KV cache. It is efficient for generating subsequent tokens after the initial prompt.

-Hybrid Mode: Hybrid mode leverages the strengths of both batch prefill and KV cache modes to optimize token generation speed. Initially, it uses prefill mode to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
+Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache modes to optimize token generation speed. Initially, it uses AR-N model to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
+  - AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.


Can we add the diagram you shared as part of the readme? It's much easier to understand with it.

Certainly, that will not be a problem.

cccclai

Thanks! And the lint

shewu-quic · 2025-02-19T06:19:00Z

Thanks! And the lint

Woops, Thanks for your effort.

facebook-github-bot · 2025-02-20T03:23:03Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-02-24T18:28:05Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2025-02-24T19:21:06Z

Hey it seems like some merge conflict, can you rebase?

mode Summary: - Add `max_seq_len` to refer to maximum number of tokens that the model can process & consider at once to generate predictions/responses. - Add `prefill_ar_n` to determine the number of tokens to consume and the number of logits to produce for prompt processor in hybrid mode. - Remove prefill mode

facebook-github-bot · 2025-02-25T04:53:50Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

swolchok · 2025-02-25T16:05:36Z

examples/qualcomm/oss_scripts/llama/runner/io_manager.h

+  // If the cache length is zero, it indicates a BERT model, which does not use
+  // position ids or KV cache inputs.
+  const bool is_bert_{false};


why waste a byte storing this separately rather than a private method like so?

bool is_bert() const { return prefill_cache_len_ == 0; }

Thank you for mentioning that. We will include it in the upcoming PR.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025

github-actions bot mentioned this pull request Feb 11, 2025

Weekly pr metrics report - 2025-02-01..2025-02-07 wdvr/pytorch#6

Open

shewu-quic force-pushed the dev1/hutton/enable_ARN_mode branch from fc66256 to d36b867 Compare February 17, 2025 02:26

shewu-quic marked this pull request as ready for review February 17, 2025 02:50

github-actions bot mentioned this pull request Feb 17, 2025

Weekly pr metrics report - 2025-02-01..2025-02-07 wdvr/pytorch#8

Open

mergennachin requested a review from cccclai February 18, 2025 18:41

cccclai reviewed Feb 18, 2025

View reviewed changes

cccclai approved these changes Feb 19, 2025

View reviewed changes

cccclai added the release notes: qualcomm Changes to the Qualcomm backend delegate label Feb 19, 2025

shewu-quic force-pushed the dev1/hutton/enable_ARN_mode branch from 2d5fa26 to 203b87b Compare February 19, 2025 06:15

github-actions bot mentioned this pull request Feb 24, 2025

Weekly pr metrics report - 2025-02-01..2025-02-07 wdvr/pytorch#10

Open

shewu-quic requested review from JacobSzwejbka, tarun292 and larryliu0820 as code owners February 25, 2025 02:11

shewu-quic force-pushed the dev1/hutton/enable_ARN_mode branch from 5dc5c17 to 0722beb Compare February 25, 2025 02:12

shewu-quic added 4 commits February 25, 2025 10:17

fixed CI

7910a89

Add the figure to readme and fixed unused variable

6b2b64f

fixed linting

7d9a14e

shewu-quic force-pushed the dev1/hutton/enable_ARN_mode branch from 0722beb to 7d9a14e Compare February 25, 2025 02:17

cccclai merged commit 9484c01 into pytorch:main Feb 25, 2025
49 of 50 checks passed

swolchok reviewed Feb 25, 2025

View reviewed changes

github-actions bot mentioned this pull request Mar 3, 2025

Weekly pr metrics report - 2025-02-01..2025-02-07 wdvr/pytorch#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - Enable AR-N model for prompt processing in hybrid mode #8210

Qualcomm AI Engine Direct - Enable AR-N model for prompt processing in hybrid mode #8210

shewu-quic commented Feb 5, 2025 •

edited

Loading

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

shewu-quic commented Feb 17, 2025

facebook-github-bot commented Feb 18, 2025

cccclai commented Feb 18, 2025

cccclai Feb 18, 2025

shewu-quic Feb 19, 2025

cccclai left a comment

shewu-quic commented Feb 19, 2025

facebook-github-bot commented Feb 20, 2025

facebook-github-bot commented Feb 24, 2025

cccclai commented Feb 24, 2025

facebook-github-bot commented Feb 25, 2025

swolchok Feb 25, 2025

shewu-quic Feb 26, 2025

Qualcomm AI Engine Direct - Enable AR-N model for prompt processing in hybrid mode #8210

Qualcomm AI Engine Direct - Enable AR-N model for prompt processing in hybrid mode #8210

Conversation

shewu-quic commented Feb 5, 2025 • edited Loading

Summary:

Test Plan

Command

pytorch-bot bot commented Feb 5, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8210

❌ 1 New Failure

shewu-quic commented Feb 17, 2025

facebook-github-bot commented Feb 18, 2025

cccclai commented Feb 18, 2025

cccclai Feb 18, 2025

Choose a reason for hiding this comment

shewu-quic Feb 19, 2025

Choose a reason for hiding this comment

cccclai left a comment

Choose a reason for hiding this comment

shewu-quic commented Feb 19, 2025

facebook-github-bot commented Feb 20, 2025

facebook-github-bot commented Feb 24, 2025

cccclai commented Feb 24, 2025

facebook-github-bot commented Feb 25, 2025

swolchok Feb 25, 2025

Choose a reason for hiding this comment

shewu-quic Feb 26, 2025

Choose a reason for hiding this comment

shewu-quic commented Feb 5, 2025 •

edited

Loading

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading