-
Notifications
You must be signed in to change notification settings - Fork 536
Qualcomm AI Engine Direct - The performance issue about mutable buffer #6493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - The performance issue about mutable buffer #6493
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6493
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I have a couple of questions/suggestions. For each optimziation that was applied what is the benefit % and in terms of num tok/sec. For example, this change " Change the data type of input token from int64 to int32." seems bit from model's API perspective. Regarding mutable buffer copy at the end. This can be avoided only under the assumption that delegate manages and consumes the mutable buffer entirely. So if you manage your own kv cache and consume such mutable buffer, than you can avoid this copy |
see this pr for allowing delegates to consume buffers like kv cache #4830 |
Is copy op faster or this option faster |
I think that QNN_PROPERTY_TENSOR_SUPPORT_UPDATEABLE_STATIC_TENSORS is used for Lora. |
Thanks for your suggestion.
Got it. I will add the percentage for each item.
Does it mean we need to in-place update kv-cache? |
* Does it mean we need to in-place update kv-cache?
If so, I am not sure does it exist in QNN HTP.
Yes. You need to take care of cache update and maintain it
From: shewu-quic ***@***.***>
Date: Sunday, October 27, 2024 at 10:55 PM
To: pytorch/executorch ***@***.***>
Cc: Kimish Patel ***@***.***>, Comment ***@***.***>
Subject: Re: [pytorch/executorch] Qualcomm AI Engine Direct - The performance issue about mutable buffer (PR #6493)
Thanks for your suggestion. For each optimziation that was applied what is the benefit % and in terms of num tok/sec. For example, this change " Change the data type of input token from int64 to int32. " seems bit from model's API perspective.
Thanks for your suggestion.
For each optimziation that was applied what is the benefit % and in terms of num tok/sec. For example, this change " Change the data type of input token from int64 to int32." seems bit from model's API perspective.
Got it. I will add the percentage for each item.
BTW, I think "Change the data type of input token from int64 to int32" is the one that optimizes performance the most, from 29.x tok/sec -> 48.x tok/sec.
Int32 is Qnn HTP friendly. It will significantly speed up qnn embedding operations.
Regarding mutable buffer copy at the end. This can be avoided only under the assumption that delegate manages and consumes the mutable buffer entirely. So if you manage your own kv cache and consume such mutable buffer, than you can avoid this copy.
Does it mean we need to in-place update kv-cache?
If so, I am not sure does it exist in QNN HTP.
—
Reply to this email directly, view it on GitHub<#6493 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEFNYVJRLPOOPJDIHRZQ5ETZ5XGTLAVCNFSM6AAAAABQSYVQ7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBQGYYTGMJWGE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Yeah since this op is insert inside to executorch, how about we consume this state update part in qnn backend, then runs |
I don’t quite understand. Could you explain a bit more? Because copy ops have been inserted inside convert_pt2e. If possible, we delegate these copy ops and don't insert again inside to_executorch. |
oh wait, can you share a bit how does the graph look like after convert_pt2e? I thought they're still in place ops. Is it related to previous unexpected graph #4627 and the hack was inserted somewhere around pt2e? I thought it's
|
May I know how could I check it whether is in place ops or not?
However, in our unit test, ExirExportedProgram is used, and its behavior is as follows:
|
Oops, in our unit test, it doesn't seem to work with mutable buffers.... |
Hi @cccclai, |
thats what i expect, because if we dont copy mutable buffer then it is functionally incorrect |
Can you manage the kv cache buffer inside qnn delegate of ET? Meaning not in qnn sdk but in qnn code inside backends/qualcomm inside ET. Here if you maintain kv cache as a buffer in qnn delegate runtime, then you can either do the same copy inside qnn delegate runtime, or have functional index_put alias to the same tensor to that index_put implmentation will modify the original buffer? But this has to be done on your end and not sure if this is a difficult change |
So @cccclai and I were discussing. See if this is feasible
|
Regarding these parts,
I was just chatting with Min and looks like DMA buf can be accessed from both QNN and cpu, can we use DMA buf for kv cache to reduce the copy, and perhaps as a way to in place update? |
This is for @shewu-quic |
Thank you all for the discussion. I think I understand your points. My initial idea yesterday was to see if we could use the memory planning pass during the AOT phase to ensure that the I/O of the mutable buffer uses the same address. This is because, for index_put, the I/O shape is actually the same. For the QNN delegate at runtime, as long as the same address is given for the mutable buffer, it can be directly updated. I will write a simple test to check it. If the above is not feasible, I think, as discussed, we need to:
3 & 4. I think if the same memory address is used, there is no need for a copy-back operation. Of course, this memory address can be a shared buffer (DMA buf), which should also speed things up. Would this be in line with your thoughts? |
I tested a simple model and attempted to hack the mutable buffer to set the same memory address. It seems to work functionally. Next, I will try to apply this to LLaMA. I hope it succeeds. Simple model: class IndexPut(torch.nn.Module):
def __init__(self):
super().__init__()
self.register_buffer(
"k_cache",
torch.zeros((1, 4, 12, 64), dtype=torch.float32),
)
def forward(self, input_pos, k_val):
k_out = torch.ops.aten.index_put_(self.k_cache, [None, input_pos], k_val)
return k_out + torch.zeros((1, 4, 12, 64), dtype=torch.float32) Sample inputs sample_input = [
(
torch.tensor([2], dtype=torch.int32),
torch.ones((1, 1, 12, 64),dtype=torch.float32),
),
(
torch.tensor([3], dtype=torch.int32),
torch.ones((1, 1, 12, 64),dtype=torch.float32)
)
] Results
|
Thank you for sharing the thoughts and provide the simple examples. I’m wondering if this small example uses the hack (use memory planning to point to the same address) or the list? Memory planning is also our official support, maybe if you can share how you do this, we have a better picture. Likely we didn’t test memory planning with in place memory update but theoretically it can work in the meanwhile, for the consuming mutable buffer solution, they are actually part of .mutable_buffers instead of .params. Hope it’s easier to handle if we are go for the second path |
I hacked here to set the same address for the mutable buffer I/O.
Actually, I have no idea to use memory planning pass to set the same address for the mutable buffer I/O. Do you have any idea or example to use it?
Thanks for your information. |
I would prefer that we take the second route you mentioned where qnn delegate consumes mutable buffer. Issue with hacking around memory planning is that mutable buffer is not memory planned in a conventional sense. And any updates to the buffer via index_put cannot alias, output cannot point to the input buffer, because the input buffer is really graph input as you have shown in the figure above. |
Got it, thank you for your detailed explanation. I will try the second method first. |
- Delegated mutable buffer in AOT - Manage mutable buffer at runtime
Hi just wanted to communicate the potential impact from the enabling batch prefill, it is possible that kv cache will be explicit IO, can we also ensure we won't have the copy overhead when we enable batch prefill? Sorry there are many moving pieces |
Do you mean kv cache will be the inputs and outputs of the model such as static llama? |
Yeah it will be similar, but for prefill when we output the kv cache, and for decode when we pass in kv cache, we still need to ensure it stays in HTP and there is no overhead for copy |
Got it. For decode, we still need to eliminate the overhead for copy. But I have the good news and the bad news.
Bad news is still worse than Xnnpack. Therefore, we are attempting to reduce the output size and have reverted to using static Llama for verification. We hope to provide an update on the situation today |
Summary:
Currently, we are running llama 3.2 1B instruct with a sequence length of 512 using llama_main’s profiling. The generated token rate is approximately 19 tokens per second. However, if we only consider the execution time of QNN, it is about 56 tokens per second.
Issue description
As far as I know, to support the mutable buffer feature, to_executorch seems to insert copy operations for mutable buffers. These copy operations cause some overhead at runtime, according to my profiling results below.
However, I have observed that Xnnpack does not seem to have this issue. Is there any way to avoid these copy operations while ensuring the mutable buffer functions correctly
Note that: