Skip to content

Slow optimizations on Dart binary #6042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tlively opened this issue Oct 23, 2023 · 4 comments
Closed

Slow optimizations on Dart binary #6042

tlively opened this issue Oct 23, 2023 · 4 comments

Comments

@tlively
Copy link
Member

tlively commented Oct 23, 2023

There is a dart binary that takes extremely long to optimize with this command:

wasm-opt -g -all --closed-world -tnh --type-unfinalizing -O3 --type-ssa --gufa -O3 --type-merging -O1 --type-finalizing

Most passes run in under a second, except that in the first -O3:

  • ssa-nomerge takes 90s
  • precompute-propagate takes 104s
  • heap2local takes 55s
  • local-subtyping takes 56s
  • code-folding takes 77s
  • dae-optimizing takes 361s
  • optimize-stack-ir takes 36s

Overall, the first -O3 takes 836s. After that GUFA takes 31s and the second -O3 takes 410s. The -O1 takes 10s, spending most of its time in two executions of coalesce-locals that take 3s each.

@kripken
Copy link
Member

kripken commented Oct 23, 2023

code-folding is odd as it doesn't use LocalGraph, but all the others do and are explained by that.

This module has a function with over 100 K locals and over 300 K local.gets, so LocalGraph has a lot of work to do.

Possible optimization ideas:

  • Running -O1 first removes all those locals, but leaves the param and the local.gets, and it is still slow.
  • Noting there are no local.sets for the param lets us skip all the block flowing.
  • Even with that, however, we do end up doing over 300 K inserts into an unordered_map<Index, ..> which is apparently quite slow.

@tlively
Copy link
Member Author

tlively commented Oct 24, 2023

Can we replace the unordered_map with index keys with a simple array / vector?

@kripken
Copy link
Member

kripken commented Oct 24, 2023

Not easily, as we'd need to build an indexing for the LocalGet* etc. pointers.

But, meanwhile it seems the larger issue by far is that we can just skip param flowing entirely, as done in #6046.

A remaining issue is that the sheer number of locals in the function makes us slow nonetheless, but running -O1 first solves that. In theory perhaps -O3 should run those passes before the first LocalGraph-using pass, but that would add significant work that usually won't help, I worry.

kripken added a commit that referenced this issue Oct 24, 2023
allGets was declared in a scope that kept it alive for all blocks, and at
the end of the loop we clear the gets for a particular block. That's clumsy,
and makes a followup harder, so this PR moves it to the natural place for it.
(That is, it moves it to the scope that handles a particular block, and removes
the manual clearing-out of the get at the end of the loop iteration.) Optimizing
compilers are smart enough to be efficient about stack allocations
of objects inside loops anyhow (which I measured).

Helps #6042.
kripken added a commit that referenced this issue Oct 24, 2023
If a local index has no sets, then all gets of that index read from the entry block
(a param, or a zero for a local). This is actually a common case, where a param
has no other set, and so it is worth optimizing, which this PR does by avoiding
any flowing operation at all for that index: we just skip and write the entry block
as the source of information for such gets.

#6042 on precompute-propagate goes from 3 minutes to 2 seconds with this (!).
But that testcase is rather special in that it is a huge function with many, many
gets in it, so the overhead we remove is very noticeable there.
radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
allGets was declared in a scope that kept it alive for all blocks, and at
the end of the loop we clear the gets for a particular block. That's clumsy,
and makes a followup harder, so this PR moves it to the natural place for it.
(That is, it moves it to the scope that handles a particular block, and removes
the manual clearing-out of the get at the end of the loop iteration.) Optimizing
compilers are smart enough to be efficient about stack allocations
of objects inside loops anyhow (which I measured).

Helps WebAssembly#6042.
radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
If a local index has no sets, then all gets of that index read from the entry block
(a param, or a zero for a local). This is actually a common case, where a param
has no other set, and so it is worth optimizing, which this PR does by avoiding
any flowing operation at all for that index: we just skip and write the entry block
as the source of information for such gets.

WebAssembly#6042 on precompute-propagate goes from 3 minutes to 2 seconds with this (!).
But that testcase is rather special in that it is a huge function with many, many
gets in it, so the overhead we remove is very noticeable there.
@tlively
Copy link
Member Author

tlively commented Jan 11, 2025

We have made significant performance improvements since I filed this issue, and since I apparently decided I didn't need to attach the file in question, this is no longer actionable.

@tlively tlively closed this as completed Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants