XNNPack configurations can be complicated to get right #8884

tarun292 · 2025-03-03T17:23:15Z

It can be a little confusing to get XNNPack quantization + delegation configuration right. As seen in this issue for example the user to had to go through quite a bit of iteration to get to the right combination
1.) They had to figure out to add DuplicateDynamicQuantChainPass
2.) Then they had to do this to get delegation working #8727 (comment)

I think for cases like 1) we should ideally not need the users to explicitly add this pass.
For cases like 2) it can be confusing to users whether to call XnnpackDynamicallyQuantizedPartitioner or XnnpackPartitioner etc. Not really sure what's the best solution for this but will leave it to @mcr229 @digantdesai

cc @digantdesai @mcr229 @cbilgin @mergennachin @byjlw

The text was updated successfully, but these errors were encountered:

mcr229 · 2025-03-03T18:53:47Z

@jerryzh168 @kimishpatel on how we can hide the DuplicateDynamicQuantChainPass.

I put up a diff in the past to find a way to hide this, but it ended up being wayyy too hacky. At the end of the day, this should live in the pt2e quant flow when a user uses the XNNPACKQuantizer. As a backend implementer, when I implement the XNNPACK Quantizer, I also want to have the flexibilility of choosing what the quantization pattern looks like, in this case I want the dynamic quant chain to be duplicated for all xnnpack backends.

This isn't something we can hide as a pass in the XNNPACK Backend, because this needs to be called before partitioning because XNNPACK Partitioner depends on being able to see this pattern. Additionally, this shouldn't be a native pass to executorch because this pattern shouldn't be forced onto other quantizers and other backends..

tarun292 · 2025-03-03T20:58:53Z

@mcr229 makes sense, i think we need to figure out where the right place is for this, and if that's the quantizer we can do that, but we need to be able to let users go e2e with XNNPack without us having to help them out.

Along those lines i also find it confusing sometimes when to use what partitioner and what configurations to make on top of them. It would be great to get that addressed too as part of this issue overall.

mcr229 · 2025-03-03T21:49:48Z

@tarun292 realistically XnnpackPartitioner() should work for every case, the other partitioners are legacy partitioners for backwards compatibility. So the issue #8727 (comment) from here is likely a bug on ourside that happened to work.

GregoryComer · 2025-03-04T01:20:40Z

Does migrating the XnnpackQuantizer into ET help us with this at all? Or is the logic buried deeper in the stack?

kimishpatel · 2025-03-04T04:17:34Z

convert_pt2e does not take quantizer as input, but maybe if we did this than quantizer can specify passes to be run post convert.

tarun292 · 2025-03-04T07:02:53Z

@kimishpatel so we modify the base quantizer class to support specifying passes inside them?

kimishpatel · 2025-03-04T15:43:06Z

@kimishpatel so we modify the base quantizer class to support specifying passes inside them?

Yes and make convert_pt2e take in quantizer. I will have to think more n this because convert is not really meant things of this sorts, but if we were to change the API in a way to quantizer is the one doing prepare and convert, then that might be another option.

But honestly these are disjoint things. Ideally I would still like to not have convert take in any of these. Problem here in particular is not that quantization should be running those passes, it is really partitioner that should be running those passes at best. Second option is quantizer.

At the heard of this, we are trying to find a place for passes where they should be run. I actually feel the least bad option is partitioner as it is closest to where this pass is needed. Our problem is that partitioner is not suppose to mutate the graph, but I think it might be ok to revisit this. THe other way to think about this is to say once graph is partitioned and handed over the preprocess, it can actually apply transformations to such a graph without actually touching the original graph. It is then the responsibility of preprocess to make sure any such mutations are still preserving the semantics of the original graph.

digantdesai · 2025-03-04T16:05:28Z

it can be confusing to users whether to call XnnpackDynamicallyQuantizedPartitioner or XnnpackPartitioner etc.

Former is a hack for a problem where we can't free unpacked weights as we pack. Without it we will segfault for LLMs as we end up with 2X weights in the RAM.

This is something we are planning to look at but P1 at best TBH. We can document this a bit better at least.

FWIW, in an ideal world there would be only one partitioner and no graph breaks.

tarun292 · 2025-03-04T16:34:14Z

@kimishpatel the problem with running this in the partitioner is that the partitioner already needs to see this pattern in order to be able to successfully partition this according to @mcr229 so running this in the partitioner might be hard.

@digantdesai if the actual fix for this is hard then let's add this to the documentation as you've suggested. I think we have a backend that really performs well when users can get the configuration right, and we should really make this as low effort as possible for users.

kimishpatel · 2025-03-04T18:14:24Z

it can be confusing to users whether to call XnnpackDynamicallyQuantizedPartitioner or XnnpackPartitioner etc.

Former is a hack for a problem where we can't free unpacked weights as we pack. Without it we will segfault for LLMs as we end up with 2X weights in the RAM.

This is something we are planning to look at but P1 at best TBH. We can document this a bit better at least.

FWIW, in an ideal world there would be only one partitioner and no graph breaks.

@digantdesai if we moved dynamic quantization of activation outside of xnnpack, that could also reduce memory pressure? Again doesnt feel like the right solution but just a thought

kimishpatel · 2025-03-04T18:16:40Z

@kimishpatel the problem with running this in the partitioner is that the partitioner already needs to see this pattern in order to be able to successfully partition this according to @mcr229 so running this in the partitioner might be hard.

@mcr229 is the issue with tagging or tagging itself is broken due to not matching the pattern. If so @tarun292 can i suggest doing such graph mutation within the partitioner, if partitioner realizes that there is something it can consume? In this case the mutations must be consumed

tarun292 · 2025-03-04T19:15:38Z

tagging itself is broken due to not matching the pattern

@mcr229 can confirm, but i think this is the issue, hence hard to do the mutation in the partitioner itself.

mcr229 · 2025-03-04T19:26:47Z

so I think it might be ok to partition the pattern as is, and the apply pass within XNNPACK, i think the problem with that is we lose the ability to do per-op mode, which is essentially keep every dq linear in its own partition(this is because we have to partition together dq linear patterns which share the same dynamic quant chain). From what Digant was mentioning this causes issues with memory, because XNNPACK doesn't have a mechanism to free unpacked weights right after we pack, meaning at worst we have all unpacked + packed weights of a single partition in memory altogether.

That said, while the partition size increases, i don't think it would cause all the dq linears to belong in the same partition. I imagine at most a couple dq linears will lie within the same partition. So i don't think this would signficantly regress memory, but I don't know how much of a regression this would cause.

kimishpatel · 2025-03-05T04:30:41Z

so I think it might be ok to partition the pattern as is, and the apply pass within XNNPACK, i think the problem with that is we lose the ability to do per-op mode, which is essentially keep every dq linear in its own partition(this is because we have to partition together dq linear patterns which share the same dynamic quant chain). From what Digant was mentioning this causes issues with memory, because XNNPACK doesn't have a mechanism to free unpacked weights right after we pack, meaning at worst we have all unpacked + packed weights of a single partition in memory altogether.

That said, while the partition size increases, i don't think it would cause all the dq linears to belong in the same partition. I imagine at most a couple dq linears will lie within the same partition. So i don't think this would signficantly regress memory, but I don't know how much of a regression this would cause.

In the case LLM the FFN layers seem to have a couple of linears that take the same input. These are large, but it wont double the peak memory. Worth trying it out.

But good point on the loss of the ability to apply single op partitioner. My concern with moving this to quantization is coupling different concerns together and in the long term it doesnt turn out to be good

tarun292 added the module: user experience Issues related to reducing friction for users label Mar 3, 2025

github-project-automation bot added this to ExecuTorch DevX Mar 3, 2025

github-project-automation bot moved this to To triage in ExecuTorch DevX Mar 3, 2025

tarun292 added the module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ label Mar 3, 2025

tarun292 assigned mcr229 Mar 3, 2025

iseeyuan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 3, 2025

mcr229 mentioned this issue Mar 5, 2025

torch._dynamo.exc.Unsupported: Unsupported: quantized nyi in meta tensors with fake tensor propagation. #8727

Closed

github-actions bot mentioned this issue Mar 10, 2025

Weekly issue metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#15

Open

GregoryComer moved this from To triage to Incoming in ExecuTorch DevX Mar 13, 2025

This was referenced Mar 17, 2025

Weekly issue metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#17

Open

Weekly issue metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#19

Open

github-actions bot mentioned this issue Mar 31, 2025

Weekly issue metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#21

Open

github-actions bot mentioned this issue Apr 7, 2025

Weekly issue metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XNNPack configurations can be complicated to get right #8884

XNNPack configurations can be complicated to get right #8884

tarun292 commented Mar 3, 2025 •

edited by pytorch-bot bot

Loading

mcr229 commented Mar 3, 2025

tarun292 commented Mar 3, 2025

mcr229 commented Mar 3, 2025

GregoryComer commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

tarun292 commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

digantdesai commented Mar 4, 2025

tarun292 commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

tarun292 commented Mar 4, 2025 •

edited

Loading

mcr229 commented Mar 4, 2025

kimishpatel commented Mar 5, 2025

XNNPack configurations can be complicated to get right #8884

XNNPack configurations can be complicated to get right #8884

Comments

tarun292 commented Mar 3, 2025 • edited by pytorch-bot bot Loading

mcr229 commented Mar 3, 2025

tarun292 commented Mar 3, 2025

mcr229 commented Mar 3, 2025

GregoryComer commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

tarun292 commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

digantdesai commented Mar 4, 2025

tarun292 commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

kimishpatel commented Mar 4, 2025

tarun292 commented Mar 4, 2025 • edited Loading

mcr229 commented Mar 4, 2025

kimishpatel commented Mar 5, 2025

tarun292 commented Mar 3, 2025 •

edited by pytorch-bot bot

Loading

tarun292 commented Mar 4, 2025 •

edited

Loading