-
Notifications
You must be signed in to change notification settings - Fork 787
Consider linking with mimalloc in release executables? #5561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Very interesting! Overall this make me think that maybe the issues we've seen with multithreading overhead are malloc contention between threads, like these: emscripten-core/emscripten#15727 #2740 It might be good to investigate two things here:
|
Looks like Google publishes a heap profiler that might be useful for this: https://gperftools.github.io/gperftools/heapprofile.html |
Linking with mimalloc to replace libc builtin allocator only requires special link-time configuration, and doesn't require changing C/C++ source code at all. When targetting wasm, you don't need to do anything special, just live with the original libc allocator. https://github.com/rui314/mold/blob/main/CMakeLists.txt#L138 is a good example for properly linking against mimalloc. Though it's even possible to not modify cmake config at all, just specify |
If I am not mistaken mimalloc supports WebAssembly but only with WASI |
You don't need mimalloc at all when targetting wasm. |
I think it could be useful in multithreaded wasm builds? Just like for native ones. |
That's correct, although the mimalloc codebase currently really just supports single-threaded wasm32-wasi. |
I see, makes sense. I'd be ok with just enabling mimalloc for non-wasm for now then, if we want to go that route. |
If anyone wants to give the mimalloc flavour a try, I've created a statically-linked x86_64-linux binary release of version_112 at https://nightly.link/type-dance/binaryen/actions/artifacts/593257094.zip. The build script is available at https://github.com/type-dance/binaryen/blob/main/build-alpine.sh. |
Note: This issue is relevant for #4165 @tlively I looked into tcmalloc to profile our mallocs. I found some possible improvements and will open PRs, but I'm not sure how big an impact they will have. An issue is that tcmalloc measures the size of allocations, not the number of malloc calls, and we might have very many small allocations or quickly-freed ones that don't show up in that type of profiling. |
I just found |
Interesting! It says this:
So I tried both with the normal system malloc (which it seems it may not be able to profile) and with a userspace malloc (tcmalloc). The results did not change much. I guess that is consistent with malloc contention not actually being an issue on some machines, perhaps because their mallocs have low contention (like the default Linux one on my machine, and tcmalloc). So to really dig into malloc performance we'd need to run mutrace on a system that sees the slowdown. @TerrorJack perhaps you can try that? But the results on my machine are interesting about non-malloc mutexes. There is one top mutex by far:
That mutex 10 is
Which is the Type mutex used in (the one after it is the thread pool mutex, |
(that is a profile on running |
Makes sense, this confirms a suspicion I had that the global type cache is extremely contended. We frequently do things like |
As another datapoint, I also ran with plain dlmalloc which doesn't have any complex per-thread pools AFAIK. But the results are the same as my system allocator and tcmalloc. So somehow I just don't see malloc contention on my machine... |
Just another data point to consider. I did some crude timings testing wasm-opt with using LD_PRELOAD and measured mimalloc vs jemalloc vs glibc allocators. The pre optimized wasm file is ~117MB
Here is output running with BINARYEN_PASS_DEBUG=1 |
@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed $ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112) |
Seems to be working fine for me
|
This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for #5561
This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for WebAssembly#5561 (comment) Helps WebAssembly#4165
This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for WebAssembly#5561
I stumbled over this issue when compiling/optimizing a Kotlin/Wasm application, specifically the benchmarks in https://github.com/JetBrains/compose-multiplatform/tree/ok/benchmarks_d8/benchmarks/multiplatform via To easily reproduce this, simply download the attached input WasmGC binary and measure via wasm-opt
See the very high system time. Looking at it with wasm-opt
I assume the changes linked above reduced the number of allocations drastically, hence removing most of the allocator lock contention. However, this can be sped up by another 30% (!) by using mimalloc as an allocator. Specifically:
This gives an even better runtime (at the cost of higher max RSS):
|
Forgot the binary for the repro, attached: compose-benchmarks-benchmarks-wasm-js.zip |
I also tried to briefly check if there are more allocations one could get rid of, both by regular profiling and taking a heap profile from a
gives this profile (sorry, Googlers-only): https://pprof.corp.google.com/?id=4b7bcac9402431aa89920010f8724f15 I am not a binaryen expert, but I also quickly took a heap profile with tcmalloc from gperftools:
which gives this heap profile: https://pprof.corp.google.com/?id=862bc2615a922858148915a281818d11&metric=alloc_objects&tab=bottomup Again, I am not a binaryen expert, but there are a couple of heavily allocating or reallocating functions:
|
The big speedup since 119 is likely the large amount of work we did a few months ago, mentioned in #4165
All those rebuild the type graph when they find things to optimize. The allocations are interesting, thanks for that analysis. Those are definitely worth looking into. |
Ha, I've found out why using mimalloc via binaryen/.github/workflows/create_release.yml Line 106 in 9b161a9
This uses musl libc (since glibc cannot be statically linked), and that uses an allocator that is suffering from heavy lock contention in multi-threaded workloads, see e.g., https://nickb.dev/blog/default-musl-allocator-considered-harmful-to-performance/ In other words, all users that don't build binaryen from source but use the Linux binaries from https://github.com/WebAssembly/binaryen/releases will suffer from this allocator, especially on machines with many cores. The solution is probably to switch to dynamic linking, which then will use glibc; or if a static binary is required, to use a better allocator (mimalloc supports static linking, see https://github.com/microsoft/mimalloc?tab=readme-ov-file#using-the-library) |
Thanks @danleh ! I believe we build on Alpine in order to get a single static build that can run on as many Linux distros as possible. Getting a dynamic build to run that way is harder IIUC, but I am not an expert on this. @dschuff @sbc100 Do you know what we should be doing here? And how do we build the emsdk Binaryen - is that also linked statically with malloc/free? (If so perhaps it is slow as well, or if not, perhaps we can do the same here as we do there?) |
Which fixes performance problems, especially on heavily multi-threaded workloads (many functions, high core count machines), see WebAssembly#5561.
Right, makes sense. If adding mimalloc to Build instructions:
gives (i.e., uses mimalloc)
and is statically linked
|
If the goal is just to link mimalloc from a dynamic library (some prebuilt distribution?) instead of linking it statically from Alpine's musl, I think that should be possible. It would of course also be possible to link it statically (via a prebuilt archive file, or by building it ourselves). The emsdk binaryen is dynamically linked against glibc, so interposition is possible. We use Chromium's infrastructure to build that, and Chromium's solution for good compatibility is to build in a sysroot which has a pretty old version of glibc, and while not as maximally-compatible as statically linking libc, it works in practice on most systems. I don't recall hearing any requests to work on older systems. Having said all that, I think the following are what I have historically thought we should focus on:
@danleh's experience suggests that I could be wrong about 2 (obviously he's using our prebuilt, but not Alpine). It seems OP is in the "DIY build" category. Having said all that, including mimalloc in third_party as a submodule (or just a snapshot), and adding a CMake option seems like it would be pretty easy, and maybe we should just do that, if there's a good performance win compared to glibc and/or the default mac/windows allocators. |
Ah great, this sounds good to me too. Maybe we can do performance tests (or solicit them from users) on Mac and Windows too, and figure out where exactly we should enable it by default. |
Funnily enough, we already do that, Lines 321 to 323 in 9b161a9
We did this after https://web.dev/articles/scaling-multithreaded-webassembly-applications#mimalloc So maybe the wasm build has been faster than the native one... I didn't imagine a native build could have similar issues. Good find @danleh ! |
Let's put concrete comments/suggestions on how to integrate and link mimalloc in the PR #7378. Regarding more general discussion and questions:
As @tlively said, I don't think that's true (anymore). At least Kotlin (via Gradle) use your tagged releases, and I am pretty sure average users often go the "path of least resistance" and just take your binaries (instead of building from source). (CC @eymar from Kotlin/JetBrains.)
That's pretty cool and at least in principle I would prefer dynamic linking of an old glibc over static linking of musl + mimalloc, for three reasons: (i) code size (even before mimalloc, the static On the other hand, pulling in Chromium's infrastructure only for more portable binaries seems like total overkill, and you probably had good reasons to go the Alpine Linux / static linking route in the past. So I am fine with just statically linking mimalloc for fixing this particular performance issue.
I don't have numbers on Windows or Mac OS, but I think their allocator performance is much better than musl libc's in multi-threaded workloads, so the relative benefit of mimalloc over them is much smaller there. In other words: I don't think there is urgency for Windows or Mac, and I enabled mimalloc only for the static Linux build for now. |
And just to clarify explicitly, I had originally mixed up these points:
Concrete numbers from my 128 core x64 machine, optimizing a 12 MB WasmGC binary (see #5561 (comment)):
(Building C++ with a statically linked musl libc is a pain on Ubuntu, hence I only have numbers from the prebuilt binaries and dynamically linking glibc.) |
1.5x faster runtime but also 1.5x more RSS. Trade memory for speed and vise verse rule in action. |
@MaxGraey Compare lines 2 and 4 (122 with musl's allocator, 122 with some other allocator). The difference is almost 10x - I guess musl's allocator is just not that good at heavily multithreaded allocations. I do agree 50% more memory is a downside though. It seems worth it in the official linux binaries for a 9x speedup, but for dynamic builds where the speedup is is just 1.5x, maybe not. |
I wonder if it would be better musl with static linkage + static mimalloc linkages (which should override musl's allocator). In theory it should be a little faster than dynamic mimalloc & glibc and should definitely reduce RSS. |
i) Is 18MB a lot? For the emscripten release builds we statically link libc++ and libbinaryen; wasm-opt is 19MB, clang is 131MB and the whole package is 300MB compressed and 1.2G uncompressed. We don't get too many (?) complaints about that, so I'm not sure I'd worry about code size too much. I don't think we'd need to bring all of Chromium's infrastructure in here. We could probably get away with just pulling in the sysroot. Or maybe there's an older Linux image available on Github than the one we are using. |
Numbers with glibc + mimalloc statically linked:
In other words, at least for this workload and configuration (glibc, mimalloc), static linking doesn't make a difference (compare with row 6 above). As I said, building C++ code with musl on Ubuntu is a bit tedious (you have to build libstdc++ yourself from sources), so I don't have numbers for it.
Thanks for these ballpark numbers. Makes sense, then binary size here is not really a concern. |
The new CMake flag MIMALLOC_STATIC controls this. mimalloc fixes perf problems, especially on heavily multi-threaded workloads (many functions, high core count machines) on the musl allocator, see #5561.
Fixed by #7378 , which uses mimalloc on Linux static builds (including our official ones), which is the only place we are aware of issues atm. |
Just for completeness, I'll mention that I actually experimented with using the Chromium sysroot instead of Alpine for the release build, but I forgot that the sysroot doesn't have a new enough version of libstdc++ for Binaryen. Chrome (and our emsdk build) solve this by building our own version of libc++ to go along with the app, and we could certainly do that here too. It's just more of a pain than what we have here. |
I've seen huge performance improvement when
wasm-opt
is linked withmimalloc
and optimizes a big wasm module on a many-cores machine!Result sans
mimalloc
:Result with
mimalloc
:version_112
, while the with mimalloc case is compiled from the same version, but linked with mimalloc v2.0.9wasm-opt -Oz hello.wasm -o hello.opt.wasm
,hello.wasm
is 26MBubuntu:22.10
container on a server with AMD EPYC 7401P (48 logical cores) and 128GB memory.The text was updated successfully, but these errors were encountered: