Skip to content

XNNPack configurations can be complicated to get right #8884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tarun292 opened this issue Mar 3, 2025 · 14 comments
Open

XNNPack configurations can be complicated to get right #8884

tarun292 opened this issue Mar 3, 2025 · 14 comments
Assignees
Labels
module: user experience Issues related to reducing friction for users module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@tarun292
Copy link
Contributor

tarun292 commented Mar 3, 2025

It can be a little confusing to get XNNPack quantization + delegation configuration right. As seen in this issue for example the user to had to go through quite a bit of iteration to get to the right combination
1.) They had to figure out to add DuplicateDynamicQuantChainPass
2.) Then they had to do this to get delegation working #8727 (comment)

I think for cases like 1) we should ideally not need the users to explicitly add this pass.
For cases like 2) it can be confusing to users whether to call XnnpackDynamicallyQuantizedPartitioner or XnnpackPartitioner etc. Not really sure what's the best solution for this but will leave it to @mcr229 @digantdesai

cc @digantdesai @mcr229 @cbilgin @mergennachin @byjlw

@tarun292 tarun292 added the module: user experience Issues related to reducing friction for users label Mar 3, 2025
@github-project-automation github-project-automation bot moved this to To triage in ExecuTorch DevX Mar 3, 2025
@tarun292 tarun292 added the module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ label Mar 3, 2025
@mcr229
Copy link
Contributor

mcr229 commented Mar 3, 2025

@jerryzh168 @kimishpatel on how we can hide the DuplicateDynamicQuantChainPass.

I put up a diff in the past to find a way to hide this, but it ended up being wayyy too hacky. At the end of the day, this should live in the pt2e quant flow when a user uses the XNNPACKQuantizer. As a backend implementer, when I implement the XNNPACK Quantizer, I also want to have the flexibilility of choosing what the quantization pattern looks like, in this case I want the dynamic quant chain to be duplicated for all xnnpack backends.

This isn't something we can hide as a pass in the XNNPACK Backend, because this needs to be called before partitioning because XNNPACK Partitioner depends on being able to see this pattern. Additionally, this shouldn't be a native pass to executorch because this pattern shouldn't be forced onto other quantizers and other backends..

@iseeyuan iseeyuan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 3, 2025
@tarun292
Copy link
Contributor Author

tarun292 commented Mar 3, 2025

@mcr229 makes sense, i think we need to figure out where the right place is for this, and if that's the quantizer we can do that, but we need to be able to let users go e2e with XNNPack without us having to help them out.

Along those lines i also find it confusing sometimes when to use what partitioner and what configurations to make on top of them. It would be great to get that addressed too as part of this issue overall.

@mcr229
Copy link
Contributor

mcr229 commented Mar 3, 2025

@tarun292 realistically XnnpackPartitioner() should work for every case, the other partitioners are legacy partitioners for backwards compatibility. So the issue #8727 (comment) from here is likely a bug on ourside that happened to work.

@GregoryComer
Copy link
Member

Does migrating the XnnpackQuantizer into ET help us with this at all? Or is the logic buried deeper in the stack?

@kimishpatel
Copy link
Contributor

convert_pt2e does not take quantizer as input, but maybe if we did this than quantizer can specify passes to be run post convert.

@tarun292
Copy link
Contributor Author

tarun292 commented Mar 4, 2025

@kimishpatel so we modify the base quantizer class to support specifying passes inside them?

@kimishpatel
Copy link
Contributor

@kimishpatel so we modify the base quantizer class to support specifying passes inside them?

Yes and make convert_pt2e take in quantizer. I will have to think more n this because convert is not really meant things of this sorts, but if we were to change the API in a way to quantizer is the one doing prepare and convert, then that might be another option.

But honestly these are disjoint things. Ideally I would still like to not have convert take in any of these. Problem here in particular is not that quantization should be running those passes, it is really partitioner that should be running those passes at best. Second option is quantizer.

At the heard of this, we are trying to find a place for passes where they should be run. I actually feel the least bad option is partitioner as it is closest to where this pass is needed. Our problem is that partitioner is not suppose to mutate the graph, but I think it might be ok to revisit this. THe other way to think about this is to say once graph is partitioned and handed over the preprocess, it can actually apply transformations to such a graph without actually touching the original graph. It is then the responsibility of preprocess to make sure any such mutations are still preserving the semantics of the original graph.

@digantdesai
Copy link
Contributor

it can be confusing to users whether to call XnnpackDynamicallyQuantizedPartitioner or XnnpackPartitioner etc.

Former is a hack for a problem where we can't free unpacked weights as we pack. Without it we will segfault for LLMs as we end up with 2X weights in the RAM.

This is something we are planning to look at but P1 at best TBH. We can document this a bit better at least.

FWIW, in an ideal world there would be only one partitioner and no graph breaks.

@tarun292
Copy link
Contributor Author

tarun292 commented Mar 4, 2025

@kimishpatel the problem with running this in the partitioner is that the partitioner already needs to see this pattern in order to be able to successfully partition this according to @mcr229 so running this in the partitioner might be hard.

@digantdesai if the actual fix for this is hard then let's add this to the documentation as you've suggested. I think we have a backend that really performs well when users can get the configuration right, and we should really make this as low effort as possible for users.

@kimishpatel
Copy link
Contributor

it can be confusing to users whether to call XnnpackDynamicallyQuantizedPartitioner or XnnpackPartitioner etc.

Former is a hack for a problem where we can't free unpacked weights as we pack. Without it we will segfault for LLMs as we end up with 2X weights in the RAM.

This is something we are planning to look at but P1 at best TBH. We can document this a bit better at least.

FWIW, in an ideal world there would be only one partitioner and no graph breaks.

@digantdesai if we moved dynamic quantization of activation outside of xnnpack, that could also reduce memory pressure? Again doesnt feel like the right solution but just a thought

@kimishpatel
Copy link
Contributor

@kimishpatel the problem with running this in the partitioner is that the partitioner already needs to see this pattern in order to be able to successfully partition this according to @mcr229 so running this in the partitioner might be hard.

@mcr229 is the issue with tagging or tagging itself is broken due to not matching the pattern. If so @tarun292 can i suggest doing such graph mutation within the partitioner, if partitioner realizes that there is something it can consume? In this case the mutations must be consumed

@tarun292
Copy link
Contributor Author

tarun292 commented Mar 4, 2025

tagging itself is broken due to not matching the pattern

@mcr229 can confirm, but i think this is the issue, hence hard to do the mutation in the partitioner itself.

@mcr229
Copy link
Contributor

mcr229 commented Mar 4, 2025

so I think it might be ok to partition the pattern as is, and the apply pass within XNNPACK, i think the problem with that is we lose the ability to do per-op mode, which is essentially keep every dq linear in its own partition(this is because we have to partition together dq linear patterns which share the same dynamic quant chain). From what Digant was mentioning this causes issues with memory, because XNNPACK doesn't have a mechanism to free unpacked weights right after we pack, meaning at worst we have all unpacked + packed weights of a single partition in memory altogether.

That said, while the partition size increases, i don't think it would cause all the dq linears to belong in the same partition. I imagine at most a couple dq linears will lie within the same partition. So i don't think this would signficantly regress memory, but I don't know how much of a regression this would cause.

@kimishpatel
Copy link
Contributor

so I think it might be ok to partition the pattern as is, and the apply pass within XNNPACK, i think the problem with that is we lose the ability to do per-op mode, which is essentially keep every dq linear in its own partition(this is because we have to partition together dq linear patterns which share the same dynamic quant chain). From what Digant was mentioning this causes issues with memory, because XNNPACK doesn't have a mechanism to free unpacked weights right after we pack, meaning at worst we have all unpacked + packed weights of a single partition in memory altogether.

That said, while the partition size increases, i don't think it would cause all the dq linears to belong in the same partition. I imagine at most a couple dq linears will lie within the same partition. So i don't think this would signficantly regress memory, but I don't know how much of a regression this would cause.

In the case LLM the FFN layers seem to have a couple of linears that take the same input. These are large, but it wont double the peak memory. Worth trying it out.

But good point on the loss of the ability to apply single op partitioner. My concern with moving this to quantization is coupling different concerns together and in the long term it doesnt turn out to be good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: user experience Issues related to reducing friction for users module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Incoming
Development

No branches or pull requests

6 participants