-
Notifications
You must be signed in to change notification settings - Fork 536
XNNPack configurations can be complicated to get right #8884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jerryzh168 @kimishpatel on how we can hide the DuplicateDynamicQuantChainPass. I put up a diff in the past to find a way to hide this, but it ended up being wayyy too hacky. At the end of the day, this should live in the pt2e quant flow when a user uses the XNNPACKQuantizer. As a backend implementer, when I implement the XNNPACK Quantizer, I also want to have the flexibilility of choosing what the quantization pattern looks like, in this case I want the dynamic quant chain to be duplicated for all xnnpack backends. This isn't something we can hide as a pass in the XNNPACK Backend, because this needs to be called before partitioning because XNNPACK Partitioner depends on being able to see this pattern. Additionally, this shouldn't be a native pass to executorch because this pattern shouldn't be forced onto other quantizers and other backends.. |
@mcr229 makes sense, i think we need to figure out where the right place is for this, and if that's the quantizer we can do that, but we need to be able to let users go e2e with XNNPack without us having to help them out. Along those lines i also find it confusing sometimes when to use what partitioner and what configurations to make on top of them. It would be great to get that addressed too as part of this issue overall. |
@tarun292 realistically XnnpackPartitioner() should work for every case, the other partitioners are legacy partitioners for backwards compatibility. So the issue #8727 (comment) from here is likely a bug on ourside that happened to work. |
Does migrating the XnnpackQuantizer into ET help us with this at all? Or is the logic buried deeper in the stack? |
convert_pt2e does not take quantizer as input, but maybe if we did this than quantizer can specify passes to be run post convert. |
@kimishpatel so we modify the base quantizer class to support specifying passes inside them? |
Yes and make convert_pt2e take in quantizer. I will have to think more n this because convert is not really meant things of this sorts, but if we were to change the API in a way to quantizer is the one doing prepare and convert, then that might be another option. But honestly these are disjoint things. Ideally I would still like to not have convert take in any of these. Problem here in particular is not that quantization should be running those passes, it is really partitioner that should be running those passes at best. Second option is quantizer. At the heard of this, we are trying to find a place for passes where they should be run. I actually feel the least bad option is partitioner as it is closest to where this pass is needed. Our problem is that partitioner is not suppose to mutate the graph, but I think it might be ok to revisit this. THe other way to think about this is to say once graph is partitioned and handed over the preprocess, it can actually apply transformations to such a graph without actually touching the original graph. It is then the responsibility of preprocess to make sure any such mutations are still preserving the semantics of the original graph. |
Former is a hack for a problem where we can't free unpacked weights as we pack. Without it we will segfault for LLMs as we end up with 2X weights in the RAM. This is something we are planning to look at but P1 at best TBH. We can document this a bit better at least. FWIW, in an ideal world there would be only one partitioner and no graph breaks. |
@kimishpatel the problem with running this in the partitioner is that the partitioner already needs to see this pattern in order to be able to successfully partition this according to @mcr229 so running this in the partitioner might be hard. @digantdesai if the actual fix for this is hard then let's add this to the documentation as you've suggested. I think we have a backend that really performs well when users can get the configuration right, and we should really make this as low effort as possible for users. |
@digantdesai if we moved dynamic quantization of activation outside of xnnpack, that could also reduce memory pressure? Again doesnt feel like the right solution but just a thought |
@mcr229 is the issue with tagging or tagging itself is broken due to not matching the pattern. If so @tarun292 can i suggest doing such graph mutation within the partitioner, if partitioner realizes that there is something it can consume? In this case the mutations must be consumed |
@mcr229 can confirm, but i think this is the issue, hence hard to do the mutation in the partitioner itself. |
so I think it might be ok to partition the pattern as is, and the apply pass within XNNPACK, i think the problem with that is we lose the ability to do per-op mode, which is essentially keep every dq linear in its own partition(this is because we have to partition together dq linear patterns which share the same dynamic quant chain). From what Digant was mentioning this causes issues with memory, because XNNPACK doesn't have a mechanism to free unpacked weights right after we pack, meaning at worst we have all unpacked + packed weights of a single partition in memory altogether. That said, while the partition size increases, i don't think it would cause all the dq linears to belong in the same partition. I imagine at most a couple dq linears will lie within the same partition. So i don't think this would signficantly regress memory, but I don't know how much of a regression this would cause. |
In the case LLM the FFN layers seem to have a couple of linears that take the same input. These are large, but it wont double the peak memory. Worth trying it out. But good point on the loss of the ability to apply single op partitioner. My concern with moving this to quantization is coupling different concerns together and in the long term it doesnt turn out to be good |
It can be a little confusing to get XNNPack quantization + delegation configuration right. As seen in this issue for example the user to had to go through quite a bit of iteration to get to the right combination
1.) They had to figure out to add
DuplicateDynamicQuantChainPass
2.) Then they had to do this to get delegation working #8727 (comment)
I think for cases like 1) we should ideally not need the users to explicitly add this pass.
For cases like 2) it can be confusing to users whether to call XnnpackDynamicallyQuantizedPartitioner or XnnpackPartitioner etc. Not really sure what's the best solution for this but will leave it to @mcr229 @digantdesai
cc @digantdesai @mcr229 @cbilgin @mergennachin @byjlw
The text was updated successfully, but these errors were encountered: