Skip to content

Reduce threading overhead #2740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kripken opened this issue Apr 9, 2020 · 3 comments
Open

Reduce threading overhead #2740

kripken opened this issue Apr 9, 2020 · 3 comments

Comments

@kripken
Copy link
Member

kripken commented Apr 9, 2020

Our threading overhead seems significant. When I measure a fixed pure computational workload, replacing the body of a pass like precompute to instead just do some silly work, then measuring with time, the user time is the same when BINARYEN_CORES=1 (use 1 core) and when running normally with all cores. That makes sense since the total actual work is added up in user, and it's the same. And there isn't much synchronization overhead that slows us down.

But that's not the typical case when running real passes, the user for multi-core can be much higher, see e.g. #2733 (comment) and I see similar things locally with user being 2-3 larger when using 8 threads.

This may be a large speedup opportunity. One possibility is that we often have many tiny functions, and maybe switching between them is costly? Or maybe there is contention on locks (see that last link, but this happens even after that PR which should get rid of that).

The thread-pool using code for running passes on functions is here:

// non-debug normal mode, run them in an optimal manner - for locality it is

@tlively
Copy link
Member

tlively commented Apr 9, 2020

Some TODOs along this vein:

  • Store function parameter types separately rather than all together as a tuple
  • Investigate the performance impact of having Type contain a SmallVec rather than an index

As multivalue becomes common, we will also want to:

  • Investigate thread-local caching of commonly accessed types

@kripken
Copy link
Member Author

kripken commented Apr 10, 2020

Measuring with perf stat after #2745 things look a lot better. user time is still higher than expected, but after investigating with perf I suspect that might be slightly misleading (or maybe perf is wrong...).

It does seem though that we gain almost nothing from using the reported system cores versus half of them. My guess is hyperthreading doesn't really help us since we are very CPU work bound (no I/O to wait on, and we are cache-friendly by having small data structures and running as many passes as possible on a single function before moving on to the next). But I'm not sure we can do anything about that.

@vapier
Copy link

vapier commented Mar 14, 2022

fwiw, still seeing huge overhead with the v105 release

$ getconf _NPROCESSORS_ONLN
72

$ unset BINARYEN_CORES
$ time ./binaryen-version_105/bin/wasm-opt -O2 test -o test.wasm
real    0m21.541s
user    1m5.779s
sys     19m53.180s

$ export BINARYEN_CORES=1
$ time ./binaryen-version_105/bin/wasm-opt -O2 test -o test.wasm
real    0m8.487s
user    0m8.292s
sys     0m0.199s

$ du -h test test.wasm
2.4M    test
1.4M    test.wasm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants