Description
Our threading overhead seems significant. When I measure a fixed pure computational workload, replacing the body of a pass like precompute
to instead just do some silly work, then measuring with time
, the user
time is the same when BINARYEN_CORES=1
(use 1 core) and when running normally with all cores. That makes sense since the total actual work is added up in user
, and it's the same. And there isn't much synchronization overhead that slows us down.
But that's not the typical case when running real passes, the user
for multi-core can be much higher, see e.g. #2733 (comment) and I see similar things locally with user being 2-3 larger when using 8 threads.
This may be a large speedup opportunity. One possibility is that we often have many tiny functions, and maybe switching between them is costly? Or maybe there is contention on locks (see that last link, but this happens even after that PR which should get rid of that).
The thread-pool using code for running passes on functions is here:
Line 591 in dc5a503