-
Notifications
You must be signed in to change notification settings - Fork 787
Slow optimizations on Dart binary #6042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This module has a function with over 100 K locals and over 300 K Possible optimization ideas:
|
Can we replace the unordered_map with index keys with a simple array / vector? |
Not easily, as we'd need to build an indexing for the But, meanwhile it seems the larger issue by far is that we can just skip param flowing entirely, as done in #6046. A remaining issue is that the sheer number of locals in the function makes us slow nonetheless, but running |
allGets was declared in a scope that kept it alive for all blocks, and at the end of the loop we clear the gets for a particular block. That's clumsy, and makes a followup harder, so this PR moves it to the natural place for it. (That is, it moves it to the scope that handles a particular block, and removes the manual clearing-out of the get at the end of the loop iteration.) Optimizing compilers are smart enough to be efficient about stack allocations of objects inside loops anyhow (which I measured). Helps #6042.
If a local index has no sets, then all gets of that index read from the entry block (a param, or a zero for a local). This is actually a common case, where a param has no other set, and so it is worth optimizing, which this PR does by avoiding any flowing operation at all for that index: we just skip and write the entry block as the source of information for such gets. #6042 on precompute-propagate goes from 3 minutes to 2 seconds with this (!). But that testcase is rather special in that it is a huge function with many, many gets in it, so the overhead we remove is very noticeable there.
allGets was declared in a scope that kept it alive for all blocks, and at the end of the loop we clear the gets for a particular block. That's clumsy, and makes a followup harder, so this PR moves it to the natural place for it. (That is, it moves it to the scope that handles a particular block, and removes the manual clearing-out of the get at the end of the loop iteration.) Optimizing compilers are smart enough to be efficient about stack allocations of objects inside loops anyhow (which I measured). Helps WebAssembly#6042.
If a local index has no sets, then all gets of that index read from the entry block (a param, or a zero for a local). This is actually a common case, where a param has no other set, and so it is worth optimizing, which this PR does by avoiding any flowing operation at all for that index: we just skip and write the entry block as the source of information for such gets. WebAssembly#6042 on precompute-propagate goes from 3 minutes to 2 seconds with this (!). But that testcase is rather special in that it is a huge function with many, many gets in it, so the overhead we remove is very noticeable there.
We have made significant performance improvements since I filed this issue, and since I apparently decided I didn't need to attach the file in question, this is no longer actionable. |
There is a dart binary that takes extremely long to optimize with this command:
Most passes run in under a second, except that in the first -O3:
Overall, the first -O3 takes 836s. After that GUFA takes 31s and the second -O3 takes 410s. The -O1 takes 10s, spending most of its time in two executions of coalesce-locals that take 3s each.
The text was updated successfully, but these errors were encountered: