Skip to content

Qualcomm AI Engine Direct - Enable AR-N model for prompt processing in hybrid mode #8210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 25, 2025

Conversation

shewu-quic
Copy link
Collaborator

@shewu-quic shewu-quic commented Feb 5, 2025

Summary:

  • Add --max_seq_len to refer to maximum number of tokens that the model can process & consider at once to generate predictions/responses.
  • Add --prefill_ar_n to determine the number of tokens to consume and the number of logits to produce for prompt processor in hybrid mode.
  • Remove prefill mode
  • The best AR-N depends on model size, use case, HW capabilities, etc.
    • Suggest to profile on-target latency measurements and decide best config

Test Plan

  • Try to find the best AR-N with CL=2048 in hybrid mode

    • Based on the below table, the best AR-N is 256
      image
  • Ensure accuracy for Stories Llama 16a4w, Prompt: "Once upon a time"

    • Before(Current Mainline) KV mode
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy was a big, red ball. One day, Lily's mom asked her to help her with the laundry. Lily was happy to help and she put all the clothes in the washing machine. After the clothes were washed, Lily's mom asked her to help her hang them up to dry. Lily saw a big, black sheet hanging on the line and she wanted to help. She grabbed the sheet and tried to hang it up
  • After(This PR) Shift Pointer and prefill_ar_len = 8 and max_seq_len=128 in hybrid mode
I 00:00:00.377477 executorch:runner.cpp:346] Prompt Processor: total 5 tokens (AR-8 * 1 iters)
I 00:00:00.927398 executorch:runner.cpp:446] 	Prompt Tokens: 5    Generated Tokens: 122
I 00:00:00.927487 executorch:runner.cpp:452] 	Model Load Time:		0.375000 (seconds)
I 00:00:00.927502 executorch:runner.cpp:462] 	Total inference time:		0.550000 (seconds)		 Rate: 	221.818182 (tokens/second)
I 00:00:00.927513 executorch:runner.cpp:470] 		Prompt evaluation:	0.008000 (seconds)		 Rate: 	625.000000 (tokens/second)
I 00:00:00.927522 executorch:runner.cpp:481] 		Generated 122 tokens:	0.542000 (seconds)		 Rate: 	225.092251 (tokens/second)
I 00:00:00.927530 executorch:runner.cpp:489] 	Time to first generated token:	0.008000 (seconds)
I 00:00:00.927538 executorch:runner.cpp:496] 	Sampling time over 122 tokens:	0.025000 (seconds)

INFO:root:Results[0]:
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy was a big, red ball. One day, Lily's mom asked her to help her with the laundry. Lily was happy to help and she put all the clothes in the washing machine. 
After the clothes were washed, Lily's mom asked her to help her hang them up to dry. Lily saw a big, black iron on the counter and asked her mom what it was for. Her mom explained that it was used to make clothes look
  • After(This PR) Smart Mask and prefill_ar_len = 8 and max_seq_len=128 in hybrid mode
I 00:00:00.928392 executorch:runner.cpp:446] 	Prompt Tokens: 5    Generated Tokens: 122
I 00:00:00.928457 executorch:runner.cpp:452] 	Model Load Time:		0.367000 (seconds)
I 00:00:00.928473 executorch:runner.cpp:462] 	Total inference time:		0.559000 (seconds)		 Rate: 	218.246869 (tokens/second)
I 00:00:00.928484 executorch:runner.cpp:470] 		Prompt evaluation:	0.039000 (seconds)		 Rate: 	128.205128 (tokens/second)
I 00:00:00.928493 executorch:runner.cpp:481] 		Generated 122 tokens:	0.520000 (seconds)		 Rate: 	234.615385 (tokens/second)
I 00:00:00.928501 executorch:runner.cpp:489] 	Time to first generated token:	0.039000 (seconds)
I 00:00:00.928509 executorch:runner.cpp:496] 	Sampling time over 122 tokens:	0.035000 (seconds)

INFO:root:Results[0]:
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy was a big, red ball. One day, Lily's mom asked her to help her with the laundry. Lily was happy to help and she put all the clothes in the washing machine. 
After the clothes were washed, Lily's mom asked her to help her hang them up to dry. Lily saw a big, black iron on the counter and asked her mom what it was for. Her mom explained that it was used to make clothes smooth

Command

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android --checkpoint ~/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth --params ~/.llama/checkpoints/Llama3.2-1B-Instruct/params.json --tokenizer_model ~/.llama/checkpoints/Llama3.2-1B-Instruct/tokenizer.model --prompt $'<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' --temperature 0 --llama_model llama3_2 --model_mode hybrid --ptq 16a4w -m SM8650 -H ${HOST} -s ${DEVICE}-a ${ARTIFACTS} --max_seq_len 2048 --prefill_ar_len 256 --num_sharding 4 --kv_updater shift_pointer

Copy link

pytorch-bot bot commented Feb 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8210

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7d9a14e with merge base 77589c6 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025
@shewu-quic shewu-quic force-pushed the dev1/hutton/enable_ARN_mode branch from fc66256 to d36b867 Compare February 17, 2025 02:26
@shewu-quic shewu-quic marked this pull request as ready for review February 17, 2025 02:50
@shewu-quic
Copy link
Collaborator Author

Hi @cccclai, @billmguo,

This PR enables the AR-N model for prompt processing in hybrid mode.
Could you please help to take a look?

Regarding the change in lower_module_backend.py, it is intended to prevent a double deletion of the persistent buffer. I observed that the buffer (freq_cos and freq_sin) is copied to each delegate node (due to graph sharding), and the original buffer is eventually deleted. Since each copied buffer shares the same target, this would result in a double deletion.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mergennachin mergennachin requested a review from cccclai February 18, 2025 18:41
@cccclai
Copy link
Contributor

cccclai commented Feb 18, 2025

Awesome! Thank you for getting it to work so quickly. Can you help fix these errors?

executorch/examples/qualcomm/oss_scripts/llama/runner/io_manager.cpp:574:7: error: unused variable 'ptr' [-Werror,-Wunused-variable]
  574 |   IO* ptr = static_cast<IO*>(data_ptr_.get());
      |       ^~~
executorch/examples/qualcomm/oss_scripts/llama/runner/io_manager.cpp:1089:11: error: unused variable 'cache_len' [-Werror,-Wunused-variable]
 1089 |   int32_t cache_len = methods_meta[0]->input_tensor_meta(0)->sizes()[1];
      |           ^~~~~~~~~
executorch/examples/qualcomm/oss_scripts/llama/runner/io_manager.cpp:1306:7: error: unused variable 'ptr' [-Werror,-Wunused-variable]
 1306 |   IO* ptr = static_cast<IO*>(data_ptr_.get());

KV Cache Mode: In KV Cache mode, the model takes in a single previous token and generates the next predicted token along with its KV cache. It is efficient for generating subsequent tokens after the initial prompt.

Hybrid Mode: Hybrid mode leverages the strengths of both batch prefill and KV cache modes to optimize token generation speed. Initially, it uses prefill mode to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache modes to optimize token generation speed. Initially, it uses AR-N model to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
- AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the diagram you shared as part of the readme? It's much easier to understand with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, that will not be a problem.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And the lint

@cccclai cccclai added the release notes: qualcomm Changes to the Qualcomm backend delegate label Feb 19, 2025
@shewu-quic shewu-quic force-pushed the dev1/hutton/enable_ARN_mode branch from 2d5fa26 to 203b87b Compare February 19, 2025 06:15
@shewu-quic
Copy link
Collaborator Author

Thanks! And the lint

Woops, Thanks for your effort.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Feb 24, 2025

Hey it seems like some merge conflict, can you rebase?

mode

Summary:
- Add `max_seq_len` to refer to maximum number of tokens that the model can process & consider at once to generate predictions/responses.
- Add `prefill_ar_n` to determine the number of tokens to consume and the number of logits to produce for prompt processor in hybrid mode.
- Remove prefill mode
@shewu-quic shewu-quic force-pushed the dev1/hutton/enable_ARN_mode branch from 0722beb to 7d9a14e Compare February 25, 2025 02:17
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai cccclai merged commit 9484c01 into pytorch:main Feb 25, 2025
49 of 50 checks passed
Comment on lines +311 to +313
// If the cache length is zero, it indicates a BERT model, which does not use
// position ids or KV cache inputs.
const bool is_bert_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why waste a byte storing this separately rather than a private method like so?

bool is_bert() const {
  return prefill_cache_len_ == 0;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for mentioning that. We will include it in the upcoming PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants