Consider linking with mimalloc in release executables? #5561

TerrorJack · 2023-03-10T13:53:22Z

I've seen huge performance improvement when wasm-opt is linked with mimalloc and optimizes a big wasm module on a many-cores machine!

Result sans mimalloc:

$ time bench ./test.sh
benchmarking ./test.sh
time                 221.9 s    (218.3 s .. 224.6 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 219.3 s    (218.1 s .. 220.3 s)
std dev              1.411 s    (1.155 s .. 1.566 s)
variance introduced by outliers: 19% (moderately inflated)


real    58m35.860s
user    129m50.133s
sys     2395m41.639s

Result with mimalloc:

$ time bench ./test.sh 
benchmarking ./test.sh
time                 14.06 s    (13.38 s .. 14.86 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.94 s    (13.82 s .. 14.03 s)
std dev              123.8 ms   (76.09 ms .. 151.4 ms)
variance introduced by outliers: 19% (moderately inflated)


real    3m43.584s
user    45m38.783s
sys     0m40.349s

The sans mimalloc cases uses the official linux x64 binaries of version_112, while the with mimalloc case is compiled from the same version, but linked with mimalloc v2.0.9
The command is wasm-opt -Oz hello.wasm -o hello.opt.wasm, hello.wasm is 26MB
The benchmark is run with bench, which runs the same command multiple times and outputs the statistics above.
The test is conducted in a ubuntu:22.10 container on a server with AMD EPYC 7401P (48 logical cores) and 128GB memory.

The text was updated successfully, but these errors were encountered:

kripken · 2023-03-10T16:40:25Z

Very interesting!

Overall this make me think that maybe the issues we've seen with multithreading overhead are malloc contention between threads, like these: emscripten-core/emscripten#15727 #2740

It might be good to investigate two things here:

How easy mimalloc integration is (if it's a single file, and has a wasm port - which we need for binaryen.js - that would be ideal).
Whether we can reduce our malloc contention. We allocate Expression objects very efficiently in arenas, so this must be just other random small allocations that we do all over the place... if so then using more SmallSet/SmallVector might help. Perhaps there is a tool that can find which stack traces lead to most of these contending allocations.

tlively · 2023-03-10T16:59:21Z

Looks like Google publishes a heap profiler that might be useful for this: https://gperftools.github.io/gperftools/heapprofile.html

TerrorJack · 2023-03-10T17:20:46Z

Linking with mimalloc to replace libc builtin allocator only requires special link-time configuration, and doesn't require changing C/C++ source code at all. When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

https://github.com/rui314/mold/blob/main/CMakeLists.txt#L138 is a good example for properly linking against mimalloc. Though it's even possible to not modify cmake config at all, just specify -DCMAKE_EXE_LINKER_FLAGS="-Wl,--push-state,--whole-archive,path/to/libmimalloc.a,--pop-state" for linux or -DCMAKE_EXE_LINKER_FLAGS="-Wl,-force_load,path/to/libmimalloc.a" for macOS at configure time.

MaxGraey · 2023-03-10T17:24:28Z

When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

If I am not mistaken mimalloc supports WebAssembly but only with WASI

TerrorJack · 2023-03-10T17:25:22Z

If I am not mistaken mimalloc supports WebAssembly but only with WASI

You don't need mimalloc at all when targetting wasm.

kripken · 2023-03-10T17:28:25Z

You don't need mimalloc at all when targetting wasm

I think it could be useful in multithreaded wasm builds? Just like for native ones.

TerrorJack · 2023-03-10T17:29:41Z

I think it could be useful in multithreaded wasm builds? Just like for native ones.

That's correct, although the mimalloc codebase currently really just supports single-threaded wasm32-wasi.

kripken · 2023-03-10T18:01:52Z

I see, makes sense.

I'd be ok with just enabling mimalloc for non-wasm for now then, if we want to go that route.

TerrorJack · 2023-03-10T18:54:39Z

If anyone wants to give the mimalloc flavour a try, I've created a statically-linked x86_64-linux binary release of version_112 at https://nightly.link/type-dance/binaryen/actions/artifacts/593257094.zip. The build script is available at https://github.com/type-dance/binaryen/blob/main/build-alpine.sh.

kripken · 2023-03-15T20:18:41Z

Note: This issue is relevant for #4165

@tlively I looked into tcmalloc to profile our mallocs. I found some possible improvements and will open PRs, but I'm not sure how big an impact they will have. An issue is that tcmalloc measures the size of allocations, not the number of malloc calls, and we might have very many small allocations or quickly-freed ones that don't show up in that type of profiling.

tlively · 2023-03-15T20:41:47Z

I just found mutrace for profiling lock contention specifically. I'd be very interested to see the results here!

http://0pointer.de/blog/projects/mutrace.html

kripken · 2023-03-15T21:48:39Z

Interesting!

It says this:

Due to the way mutrace works we cannot profile mutexes that are used internally in glibc, such as those used for synchronizing stdio and suchlike.

So I tried both with the normal system malloc (which it seems it may not be able to profile) and with a userspace malloc (tcmalloc). The results did not change much. I guess that is consistent with malloc contention not actually being an issue on some machines, perhaps because their mallocs have low contention (like the default Linux one on my machine, and tcmalloc).

So to really dig into malloc performance we'd need to run mutrace on a system that sees the slowdown. @TerrorJack perhaps you can try that?

But the results on my machine are interesting about non-malloc mutexes. There is one top mutex by far:

mutrace: Showing 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
      10 41885864 16900980 10783056     7334.222        0.000       36.992 M-.--.
      16     2470     2286      713      114.601        0.046       24.002 M-.--.
       4      734      366        0    68689.275       93.582    19386.295 M-.--.

That mutex 10 is

Mutex #10 (0x0x7f54728ffca0) first referenced by:
	libmutrace.so(pthread_mutex_lock+0x46) [0x7f54729c1576]
	libbinaryen.so(+0x9b61d7) [0x7f54723b61d7]
	libbinaryen.so(_ZN4wasm4TypeC1ENS_8HeapTypeENS_11NullabilityE+0x41) [0x7f54723b6ae1]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder12getBasicTypeEiRNS_4TypeE+0x113) [0x7f547233aed3]
	libbinaryen.so(+0x93d529) [0x7f547233d529]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder9readTypesEv+0x4dd) [0x7f5472340f9d]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder4readEv+0x748) [0x7f547235fe48]
	libbinaryen.so(_ZN4wasm12ModuleReader14readBinaryDataERSt6vectorIcSaIcEERNS_6ModuleENSt7__cxx1112basic_stringIcSt11char_traitsIcES2_EE+0x5c) [0x7f54723732cc]
	libbinaryen.so(_ZN4wasm12ModuleReader10readBinaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x73) [0x7f5472373503]
	libbinaryen.so(_ZN4wasm12ModuleReader4readENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x17a) [0x7f5472373e1a]
	wasm-opt(+0x29a06) [0x5644f215fa06]
	libc.so.6(+0x2718a) [0x7f547144618a]

Which is the Type mutex used in wasm::Type::Type(wasm::HeapType, wasm::Nullability). Perhaps that can be improved @tlively ?

(the one after it is the thread pool mutex, wasm::ThreadPool::initialize(unsigned long), which I doubt we can improve, but also it's orders of magnitude less frequent)

kripken · 2023-03-15T21:49:39Z

(that is a profile on running wasm-opt -g -all --closed-world -tnh -O3 --type-ssa --gufa -O3 --type-merging on a large Dart testcase of Wasm GC, so it does stress type optimizations I guess)

tlively · 2023-03-15T22:15:53Z

Makes sense, this confirms a suspicion I had that the global type cache is extremely contended. We frequently do things like type == Type(heapType, Nullable) or Type::isSubType(type, Type(heapType, Nullable)), and the creation of those temporary Type objects require taking the lock. I'll take an action item to try to purge these patterns from the code base.

kripken · 2023-03-15T22:17:06Z

As another datapoint, I also ran with plain dlmalloc which doesn't have any complex per-thread pools AFAIK. But the results are the same as my system allocator and tcmalloc. So somehow I just don't see malloc contention on my machine...

arsnyder16 · 2023-03-16T16:28:31Z

Just another data point to consider. I did some crude timings testing wasm-opt with using LD_PRELOAD and measured mimalloc vs jemalloc vs glibc allocators.

The pre optimized wasm file is ~117MB

mimalloc

real    1m28.478s
user    13m48.665s
sys     0m1.572s
jemalloc

real    1m0.543s
user    7m59.951s
sys     0m0.931s
glibc

real    1m25.956s
user    9m40.555s
sys     0m1.791s

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 
echo "mimalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-mimalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 
echo "jemalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-jemalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
unset LD_PRELOAD
echo "glibc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext

Here is output running with BINARYEN_PASS_DEBUG=1
output.txt

TerrorJack · 2023-03-17T15:35:07Z

@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed mimalloc is used at runtime by setting MIMALLOC_VERBOSE=1? Dynamic override via LD_PRELOAD doesn't seem to work at all for some reason:

$ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112)

arsnyder16 · 2023-03-17T19:33:39Z

Seems to be working fine for me

# MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 /root/emsdk/upstream/bin/wasm-opt --version
mimalloc: option 'show_errors': 1
mimalloc: option 'show_stats': 0
mimalloc: option 'eager_commit': 1
mimalloc: option 'deprecated_eager_region_commit': 0
mimalloc: option 'deprecated_reset_decommits': 0
mimalloc: option 'large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 0
mimalloc: option 'segment_cache': 0
mimalloc: option 'page_reset': 0
mimalloc: option 'abandoned_page_decommit': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'decommit_delay': 25
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 0
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'allow_decommit': 1
mimalloc: option 'segment_decommit_delay': 500
mimalloc: option 'decommit_extend_delay': 2
mimalloc: process init: 0x7f1a07b41f40
mimalloc: debug level : 2
mimalloc: secure level: 0
mimalloc: using 1 numa regions
wasm-opt version 112 (version_112-45-g9dcdd47a2)
heap stats:    peak      total      freed    current       unit      count
normal   1:    552 B      552 B      552 B        0          8 B       69      ok
normal   4:    4.7 KiB    4.8 KiB    4.8 KiB     32 B       32 B      155      not all freed!
normal   6:   35.6 KiB   48.2 KiB   37.5 KiB   10.7 KiB     48 B      1.0 K    not all freed!
normal   8:    9.4 KiB   22.6 KiB   16.8 KiB    5.7 KiB     64 B      361      not all freed!
normal   9:    7.2 KiB   14.2 KiB   10.4 KiB    3.8 KiB     80 B      182      not all freed!
normal  10:    2.7 KiB    4.7 KiB    3.2 KiB    1.5 KiB     96 B       50      not all freed!
normal  11:    1.8 KiB    3.9 KiB    2.7 KiB    1.2 KiB    112 B       36      not all freed!
normal  12:   18.6 KiB   19.5 KiB   18.5 KiB    1.0 KiB    128 B      156      not all freed!
normal  13:    1.5 KiB    3.1 KiB    2.1 KiB    960 B      160 B       20      not all freed!
normal  14:    768 B      1.3 KiB    960 B      384 B      192 B        7      not all freed!
normal  15:    672 B      1.7 KiB    1.3 KiB    448 B      224 B        8      not all freed!
normal  16:    512 B      512 B      256 B      256 B      256 B        2      not all freed!
normal  17:    320 B      320 B      320 B        0        320 B        1      ok
normal  18:    768 B      768 B      768 B        0        384 B        2      ok
normal  19:    448 B      896 B      896 B        0        448 B        2      ok
normal  21:    640 B      640 B      640 B        0        640 B        1      ok
normal  23:    1.7 KiB    3.5 KiB    3.5 KiB      0        896 B        4      ok
normal  25:    1.2 KiB    2.5 KiB    1.2 KiB    1.2 KiB    1.2 KiB      2      not all freed!
normal  27:    3.5 KiB    7.0 KiB    7.0 KiB      0        1.7 KiB      4      ok
normal  29:    2.5 KiB    2.5 KiB    2.5 KiB      0        2.5 KiB      1      ok
normal  31:   10.5 KiB   14.0 KiB   14.0 KiB      0        3.5 KiB      4      ok
normal  33:    5.0 KiB    5.0 KiB    5.0 KiB      0        5.0 KiB      1      ok
normal  35:   14.0 KiB   14.0 KiB   14.0 KiB      0        7.0 KiB      2      ok
normal  37:   10.0 KiB   10.0 KiB   10.0 KiB      0       10.0 KiB      1      ok
normal  41:   20.0 KiB   20.0 KiB   20.0 KiB      0       20.0 KiB      1      ok
normal  45:   40.1 KiB   40.1 KiB      0       40.1 KiB   40.1 KiB      1      not all freed!

heap stats:    peak      total      freed    current       unit      count
    normal:  142.7 Ki   231.1 Ki   166.9 Ki    64.2 Ki     112 B      2.1 K    not all freed!
     large:      0          0          0          0                            ok
      huge:      0          0          0          0                            ok
     total:  142.7 KiB  231.1 KiB  166.9 KiB   64.2 KiB                        not all freed!
malloc req:  128.6 KiB  206.6 KiB  147.9 KiB   58.6 KiB                        not all freed!

  reserved:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
 committed:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
     reset:      0          0          0          0                            ok
   touched:  357.5 KiB  379.8 KiB   99.5 KiB  280.3 KiB                        not all freed!
  segments:      1          1          0          1                            not all freed!
-abandoned:      0          0          0          0                            ok
   -cached:      0          0          0          0                            ok
     pages:     23         29         16         13                            not all freed!
-abandoned:      0          0          0          0                            ok
 -extended:     48
 -noretire:     22
     mmaps:      1
   commits:      0
   threads:      0          0          0          0                            ok
  searches:     0.3 avg
numa nodes:       1
   elapsed:       0.002 s
   process: user: 0.002 s, system: 0.000 s, faults: 0, rss: 9.0 MiB, commit: 64.0 MiB
mimalloc: process done: 0x7f1a07b41f40

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for #5561 (comment) Helps #4165

This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for #5561

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for WebAssembly#5561 (comment) Helps WebAssembly#4165

This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for WebAssembly#5561

danleh · 2025-03-11T15:26:15Z

I stumbled over this issue when compiling/optimizing a Kotlin/Wasm application, specifically the benchmarks in https://github.com/JetBrains/compose-multiplatform/tree/ok/benchmarks_d8/benchmarks/multiplatform via ./gradlew clean :benchmarks:wasmJsProductionExecutableCompileSync. Changing from my Linux system allocator to mimalloc improves wasm-opt runtime by >30%. More details, repro instructions, and allocation profiles below. CC @eymar

To easily reproduce this, simply download the attached input WasmGC binary and measure via /usr/bin/time -v wasm-opt --enable-gc --enable-reference-types --enable-exception-handling --enable-bulk-memory --enable-nontrapping-float-to-int --closed-world --inline-functions-with-loops --traps-never-happen --fast-math --type-ssa -O3 -O3 --gufa -O3 --type-merging -O3 -Oz input/compose-benchmarks-benchmarks-wasm-js.wasm -o output.wasm (the flags are what Kotlin/Wasm uses).

wasm-opt version_119 takes >8 minutes (output of /usr/bin/time -v wasm-opt ...) on a high core count x64 machine:

User time (seconds): 961.47
System time (seconds): 52019.90
Percent of CPU this job got: 10831%
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:09.15
[...]
Maximum resident set size (kbytes): 841232

See the very high system time. Looking at it with perf top -g shows this is because of lock contention in the allocator.

wasm-opt version_122-69-g0472ba2cf (I compiled locally from the current main branch) is already a vast improvement:

User time (seconds): 238.92
System time (seconds): 9.31
Percent of CPU this job got: 528%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.98
[...]
Maximum resident set size (kbytes): 753204

I assume the changes linked above reduced the number of allocations drastically, hence removing most of the allocator lock contention.

However, this can be sped up by another 30% (!) by using mimalloc as an allocator. Specifically:

sudo apt install libmimalloc2.0
# Make sure it's taken up by wasm-opt via LD_PRELOAD:
MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 wasm-opt --version
# Should print something like:
mimalloc: process init: 0x7F5DA235E400
[...]
# Then run with:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 /usr/bin/time -v wasm-opt ... (rest of the arguments from above)

This gives an even better runtime (at the cost of higher max RSS):

User time (seconds): 213.09
System time (seconds): 6.26
Percent of CPU this job got: 714%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:30.68
[...]
Maximum resident set size (kbytes): 1157180

danleh · 2025-03-11T15:30:20Z

Forgot the binary for the repro, attached: compose-benchmarks-benchmarks-wasm-js.zip

danleh · 2025-03-11T15:31:23Z

I also tried to briefly check if there are more allocations one could get rid of, both by regular profiling and taking a heap profile from a RelWithDebInfo build:

cd path/to/binaryen/
cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo && make

cd path/to/repro
perf record -g -k1 --freq=1000 path/to/wasm-opt ... && pprof -flame perf.data

gives this profile (sorry, Googlers-only): https://pprof.corp.google.com/?id=4b7bcac9402431aa89920010f8724f15

I am not a binaryen expert, but wasm::GlobalTypeRewriter::rebuildTypes and wasm::Walker::walk come up quite a bit as parents of malloc and free (see the Bottom-Up profile). Is that expected?

I also quickly took a heap profile with tcmalloc from gperftools:

LD_PRELOAD=~/gperftools/.libs/libtcmalloc_and_profiler.so MALLOCSTATS=1 HEAP_PROFILE_ALLOCATION_INTERVAL=100000000 HEAPPROFILE=tcmalloc path/to/wasm-opt ...
# This produces several heapdumps, after each 100MB allocated. You can then inspect one of the later ones with:
pprof -flame tcmalloc.0226.heap

which gives this heap profile: https://pprof.corp.google.com/?id=862bc2615a922858148915a281818d11&metric=alloc_objects&tab=bottomup

Again, I am not a binaryen expert, but there are a couple of heavily allocating or reallocating functions:

wasm::Type::getHeapTypeChildren is responsible for >7% of all allocations via std::__new_allocator::allocate (expand the latter in the bottom-up profile recursively, until you find getHeapTypeChildren)
wasm::Type::getHeapTypeChildren is responsible for >8% of all allocations via _M_realloc_append (same story: expand the latter in the bottom-up profile until you see it).
In total getHeapTypeChildren seems to be responsible for >25% of all allocations, see https://pprof.corp.google.com/?id=862bc2615a922858148915a281818d11&metric=alloc_objects&filter=focus:getHeapTypeChildren&tab=flame
wasm::PassUtils::FilteredPass::create is responsible for >6% of all allocations via std::make_unique.
wasm::PostWalker::scan or wasm::Walker::pushTask (not sure, one is partially inlined) seem to contain a wasm::SmallVector that becomes too large and then is placed on the heap (>4% of all allocations). Maybe one can just increase the inline size a bit to avoid more heap allocations?
4% of all allocations are from hash table inserts in wasm::InsertOrderedMap::insert. Maybe one could replace the std::unordered_map here with a C++23 std::flat_map or absl::flat_hash_map and pre-size that if possible?
4% of all allocations are from std::set::insert in wasm::EffectAnalyzer::InternalAnalyzer::visitLocalGet (I think here). Maybe use a flat_set here as well (C++23 or Abseil)?

kripken · 2025-03-11T16:50:55Z

The big speedup since 119 is likely the large amount of work we did a few months ago, mentioned in #4165

::walk etc. is expected - every pass does a walk over the IR, and that is the core method. There are at least no obvious optimizations left there... rebuildTypes is more surprising, but our core Type and HeapType classes are not arena-allocated (and can't easily be), so a module with lots of types and lots of type optimization opportunities can end up working a lot there, I guess. Indeed, looking at slower passes at a high level with BINARYEN_PASS_DEBUG=1, slow passes include

signature-pruning
signature-refining
gto
type-refining
abstract-type-refining

All those rebuild the type graph when they find things to optimize.

The allocations are interesting, thanks for that analysis. Those are definitely worth looking into.

danleh · 2025-03-17T13:57:52Z

@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed mimalloc is used at runtime by setting MIMALLOC_VERBOSE=1? Dynamic override via LD_PRELOAD doesn't seem to work at all for some reason:

$ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112)

Ha, I've found out why using mimalloc via LD_PRELOAD didn't work for you! The official Linux binaries of Binaryen (including wasm-opt) are built via

binaryen/.github/workflows/create_release.yml

Line 106 in 9b161a9

build-alpine:

and that produces a static binary. You can confirm that via:

$ ldd binaryen-version_122/bin/wasm-opt
        not a dynamic executable

This uses musl libc (since glibc cannot be statically linked), and that uses an allocator that is suffering from heavy lock contention in multi-threaded workloads, see e.g., https://nickb.dev/blog/default-musl-allocator-considered-harmful-to-performance/

In other words, all users that don't build binaryen from source but use the Linux binaries from https://github.com/WebAssembly/binaryen/releases will suffer from this allocator, especially on machines with many cores.

The solution is probably to switch to dynamic linking, which then will use glibc; or if a static binary is required, to use a better allocator (mimalloc supports static linking, see https://github.com/microsoft/mimalloc?tab=readme-ov-file#using-the-library)

kripken · 2025-03-17T15:37:15Z

Thanks @danleh !

I believe we build on Alpine in order to get a single static build that can run on as many Linux distros as possible. Getting a dynamic build to run that way is harder IIUC, but I am not an expert on this.

@dschuff @sbc100 Do you know what we should be doing here? And how do we build the emsdk Binaryen - is that also linked statically with malloc/free? (If so perhaps it is slow as well, or if not, perhaps we can do the same here as we do there?)

Which fixes performance problems, especially on heavily multi-threaded workloads (many functions, high core count machines), see WebAssembly#5561.

danleh · 2025-03-17T17:24:33Z

I believe we build on Alpine in order to get a single static build that can run on as many Linux distros as possible. Getting a dynamic build to run that way is harder IIUC, but I am not an expert on this.

Right, makes sense.

If adding mimalloc to third_party/ is an option, a static build using mimalloc works for me locally, see #7378.

Build instructions:

git checkout mimalloc-static
git clean -d -f -x
cmake . -DCMAKE_CXX_FLAGS="-static" -DCMAKE_C_FLAGS="-static" -DCMAKE_BUILD_TYPE=Release -DBUILD_STATIC_LIB=ON
make -j64 wasm-opt
MIMALLOC_VERBOSE=1 bin/wasm-opt --version

gives (i.e., uses mimalloc)

mimalloc: process init: 0x3E198C40
...

and is statically linked

$ ldd bin/wasm-opt 
        not a dynamic executable

dschuff · 2025-03-17T22:38:35Z

The solution is probably to switch to dynamic linking
Is the goal here to allow users to interpose on malloc and free via LD_PRELOAD?

If the goal is just to link mimalloc from a dynamic library (some prebuilt distribution?) instead of linking it statically from Alpine's musl, I think that should be possible. It would of course also be possible to link it statically (via a prebuilt archive file, or by building it ourselves).
I was wondering if it might somehow be possible to link malloc and free via the dynamic symbol table (as if they came from a dylib) but have them linked into the program rather than a dylib; this would allow interposition via LD_PRELOAD but would allow us to keep shipping a maximally-compatible Alpine binary. I think it's possible in theory but unfortunately not with the existing GNU ld.

The emsdk binaryen is dynamically linked against glibc, so interposition is possible. We use Chromium's infrastructure to build that, and Chromium's solution for good compatibility is to build in a sysroot which has a pretty old version of glibc, and while not as maximally-compatible as statically linking libc, it works in practice on most systems. I don't recall hearing any requests to work on older systems.

Having said all that, I think the following are what I have historically thought we should focus on:

Our main focus for binary performance should be on Emscripten users, since they are the vast majority of users of binary packages AFAIK.
Second focus should be on those incorporating Binaryen into their toolchains, but IIUC they are mostly not using our Alpine packages (nor even our releases? My understanding had been that they mostly build themselves from tip of tree) . So for these users we should maybe focus on making it easy to build Binaryen in a performant way. (also, am I wrong about this? Maybe my view is skewed based on the ones who talk to us and help us develop new features).
Let's not forget about the wasm build of Binaryen. My understanding is that it's made its way into some node/npm-based workflows, but I don't have a good sense of how much traction it has there.

@danleh's experience suggests that I could be wrong about 2 (obviously he's using our prebuilt, but not Alpine). It seems OP is in the "DIY build" category.

Having said all that, including mimalloc in third_party as a submodule (or just a snapshot), and adding a CMake option seems like it would be pretty easy, and maybe we should just do that, if there's a good performance win compared to glibc and/or the default mac/windows allocators.

tlively · 2025-03-17T23:30:16Z

Kotlin uses our release artifacts and Dart might as well (they certainly use our tagged releases). I think anything we can do to speed up our release artifacts is well worth doing. (@danleh already filed #7378 to add mimalloc to third_party and that sgtm.)

dschuff · 2025-03-17T23:33:04Z

Ah great, this sounds good to me too. Maybe we can do performance tests (or solicit them from users) on Mac and Windows too, and figure out where exactly we should enable it by default.

kripken · 2025-03-17T23:35:36Z

@dschuff

Let's not forget about the wasm build of Binaryen.

Funnily enough, we already do that,

binaryen/CMakeLists.txt

Lines 321 to 323 in 9b161a9

    
           # Use mimalloc to avoid a 5x slowdown: 
        
           # https://github.com/emscripten-core/emscripten/issues/15727#issuecomment-1960295018 
        
           add_link_flag("-sMALLOC=mimalloc")

We did this after

https://web.dev/articles/scaling-multithreaded-webassembly-applications#mimalloc

So maybe the wasm build has been faster than the native one... I didn't imagine a native build could have similar issues. Good find @danleh !

danleh · 2025-03-18T14:30:17Z

Let's put concrete comments/suggestions on how to integrate and link mimalloc in the PR #7378.

Regarding more general discussion and questions:

Second focus should be on those incorporating Binaryen into their toolchains, but IIUC they are mostly not using our Alpine packages (nor even our releases? My understanding had been that they mostly build themselves from tip of tree)

As @tlively said, I don't think that's true (anymore). At least Kotlin (via Gradle) use your tagged releases, and I am pretty sure average users often go the "path of least resistance" and just take your binaries (instead of building from source). (CC @eymar from Kotlin/JetBrains.)

The emsdk binaryen is dynamically linked against glibc, so interposition is possible. We use Chromium's infrastructure to build that, and Chromium's solution for good compatibility is to build in a sysroot which has a pretty old version of glibc, and while not as maximally-compatible as statically linking libc, it works in practice on most systems.

That's pretty cool and at least in principle I would prefer dynamic linking of an old glibc over static linking of musl + mimalloc, for three reasons: (i) code size (even before mimalloc, the static wasm-opt x64 Linux build is >18MB), (ii) "familiarity" / less surprises (such as these performance problems), and (iii) more seamless integration with existing tools (e.g., one can not run heaptrack or tcmalloc's heapprofiling against the statically linked binaryen release binaries).

On the other hand, pulling in Chromium's infrastructure only for more portable binaries seems like total overkill, and you probably had good reasons to go the Alpine Linux / static linking route in the past. So I am fine with just statically linking mimalloc for fixing this particular performance issue.

if there's a good performance win compared to glibc and/or the default mac/windows allocators.
Maybe we can do performance tests (or solicit them from users) on Mac and Windows too, and figure out where exactly we should enable it by default.

I don't have numbers on Windows or Mac OS, but I think their allocator performance is much better than musl libc's in multi-threaded workloads, so the relative benefit of mimalloc over them is much smaller there. In other words: I don't think there is urgency for Windows or Mac, and I enabled mimalloc only for the static Linux build for now.

danleh · 2025-03-18T16:34:07Z

And just to clarify explicitly, I had originally mixed up these points:

There have been performance improvements in wasm-opt from version 119 to 122.
glibc's allocator (used in a non-static build, e.g., when building myself from source) is much better than musl libc's allocator (used in the tagged/released Linux binaries).
mimalloc is a bit better than glibc's allocator in terms of runtime, but has higher memory footprint / max RSS.

Concrete numbers from my 128 core x64 machine, optimizing a 12 MB WasmGC binary (see #5561 (comment)):

#	config	libc	linkage	allocator	wall time	sys time	max RSS (KB)
1	official Linux binary, version 119	musl	static	musl default	8min 3s	52004s	812644
2	official Linux binary, version 122	musl	static	musl default	6min 51s	43328s	739972
3	release build from source at `version_119` tag	glibc	dynamic	glibc default	0min 56s	1145s	809272
4	release build from source at `version_122` tag	glibc	dynamic	glibc default	0min 45s	9s	731068
5	release build from source at `main` branch	glibc	dynamic	glibc default	0min 44s	9s	724360
6	release build from source at `main` + `LD_PRELOAD` of mimalloc	glibc	dynamic	mimalloc	0min 29s	9s	1112704

(Building C++ with a statically linked musl libc is a pain on Ubuntu, hence I only have numbers from the prebuilt binaries and dynamically linking glibc.)

MaxGraey · 2025-03-18T17:19:41Z

1.5x faster runtime but also 1.5x more RSS. Trade memory for speed and vise verse rule in action.

kripken · 2025-03-18T17:56:25Z

@MaxGraey Compare lines 2 and 4 (122 with musl's allocator, 122 with some other allocator). The difference is almost 10x - I guess musl's allocator is just not that good at heavily multithreaded allocations.

I do agree 50% more memory is a downside though. It seems worth it in the official linux binaries for a 9x speedup, but for dynamic builds where the speedup is is just 1.5x, maybe not.

MaxGraey · 2025-03-18T18:01:29Z

I wonder if it would be better musl with static linkage + static mimalloc linkages (which should override musl's allocator). In theory it should be a little faster than dynamic mimalloc & glibc and should definitely reduce RSS.

dschuff · 2025-03-18T20:14:21Z

That's pretty cool and at least in principle I would prefer dynamic linking of an old glibc over static linking of musl + mimalloc, for three reasons: (i) code size (even before mimalloc, the static wasm-opt x64 Linux build is >18MB), (ii) "familiarity" / less surprises (such as these performance problems), and (iii) more seamless integration with existing tools (e.g., one can not run heaptrack or tcmalloc's heapprofiling against the statically linked binaryen release binaries).

i) Is 18MB a lot? For the emscripten release builds we statically link libc++ and libbinaryen; wasm-opt is 19MB, clang is 131MB and the whole package is 300MB compressed and 1.2G uncompressed. We don't get too many (?) complaints about that, so I'm not sure I'd worry about code size too much.
iii) Preserving this capability would require not only dynamically linking libc, but also mimalloc itself. I think that's doable (at least on ELF) but not necessarily the recommended way to do it.

I don't think we'd need to bring all of Chromium's infrastructure in here. We could probably get away with just pulling in the sysroot. Or maybe there's an older Linux image available on Github than the one we are using.
Maybe the easiest thing to do first is just replace the Alpine build with something like that, and dynamically link against glibc.

danleh · 2025-03-19T13:43:25Z

I wonder if it would be better musl with static linkage + static mimalloc linkages (which should override musl's allocator). In theory it should be a little faster than dynamic mimalloc & glibc and should definitely reduce RSS.

Numbers with glibc + mimalloc statically linked:

#	config	libc	linkage	allocator	wall time	sys time	max RSS (KB)
7	release build of PR #7378 with `-static` + `-DMIMALLOC_STATIC=ON`	glibc	static (not recommended)	mimalloc	0min 30s	9s	1177720

In other words, at least for this workload and configuration (glibc, mimalloc), static linking doesn't make a difference (compare with row 6 above). As I said, building C++ code with musl on Ubuntu is a bit tedious (you have to build libstdc++ yourself from sources), so I don't have numbers for it.

i) Is 18MB a lot? For the emscripten release builds we statically link libc++ and libbinaryen; wasm-opt is 19MB, clang is 131MB and the whole package is 300MB compressed and 1.2G uncompressed. We don't get too many (?) complaints about that, so I'm not sure I'd worry about code size too much.

Thanks for these ballpark numbers. Makes sense, then binary size here is not really a concern.

The new CMake flag MIMALLOC_STATIC controls this. mimalloc fixes perf problems, especially on heavily multi-threaded workloads (many functions, high core count machines) on the musl allocator, see #5561.

kripken · 2025-03-19T16:27:11Z

Fixed by #7378 , which uses mimalloc on Linux static builds (including our official ones), which is the only place we are aware of issues atm.

dschuff · 2025-03-19T18:36:40Z

Just for completeness, I'll mention that I actually experimented with using the Chromium sysroot instead of Alpine for the release build, but I forgot that the sysroot doesn't have a new enough version of libstdc++ for Binaryen. Chrome (and our emsdk build) solve this by building our own version of libc++ to go along with the app, and we could certainly do that here too. It's just more of a pain than what we have here.
I guess the only thing missing is the ability to dynamically override malloc in the release build for profiling, but I guess Daniel already proved that it's easy enough to just build your own if you want to do that.

kripken mentioned this issue Mar 15, 2023

Use SmallVector in TypeUpdating::updateParamTypes() #5579

Open

kripken mentioned this issue Mar 21, 2023

Use a SmallVector in MergeBlocks [NFC] #5594

Merged

kripken added a commit that referenced this issue Mar 21, 2023

Use a SmallVector in MergeBlocks [NFC] (#5594)

39e3490

This makes the pass 2-3% faster in some measurements I did locally. Noticed when profiling for #5561 (comment) Helps #4165

kripken mentioned this issue Apr 4, 2023

Only update functions in optimizeAfterInlining() #5624

Merged

kripken added a commit that referenced this issue Apr 5, 2023

Only update functions in optimizeAfterInlining() (#5624)

30097e5

This saves the work of freeing and allocating for all the other maps. This is a code path that is used by several passes so it showed up in profiling for #5561

kripken mentioned this issue Mar 11, 2025

Make compilation faster 🚀 #4165

Open

danleh added a commit to danleh/binaryen that referenced this issue Mar 17, 2025

Use mimalloc allocator for static build

b06275b

Which fixes performance problems, especially on heavily multi-threaded workloads (many functions, high core count machines), see WebAssembly#5561.

danleh mentioned this issue Mar 17, 2025

Use mimalloc allocator for static build #7378

Merged

kripken closed this as completed Mar 19, 2025

kripken mentioned this issue May 1, 2025

[NFC] Use a SmallVector for HeapType children #7565

Merged

Consider linking with mimalloc in release executables? #5561

Consider linking with mimalloc in release executables? #5561

Comments

TerrorJack commented Mar 10, 2023

kripken commented Mar 10, 2023

Uh oh!

tlively commented Mar 10, 2023

Uh oh!

TerrorJack commented Mar 10, 2023

Uh oh!

MaxGraey commented Mar 10, 2023

Uh oh!

TerrorJack commented Mar 10, 2023

Uh oh!

kripken commented Mar 10, 2023

Uh oh!

TerrorJack commented Mar 10, 2023

Uh oh!

kripken commented Mar 10, 2023

Uh oh!

TerrorJack commented Mar 10, 2023

Uh oh!

kripken commented Mar 15, 2023

Uh oh!

tlively commented Mar 15, 2023

Uh oh!

kripken commented Mar 15, 2023

Uh oh!

kripken commented Mar 15, 2023

Uh oh!

tlively commented Mar 15, 2023

Uh oh!

kripken commented Mar 15, 2023

Uh oh!

arsnyder16 commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TerrorJack commented Mar 17, 2023

Uh oh!

arsnyder16 commented Mar 17, 2023

Uh oh!

danleh commented Mar 11, 2025

Uh oh!

danleh commented Mar 11, 2025

Uh oh!

danleh commented Mar 11, 2025

Uh oh!

kripken commented Mar 11, 2025

Uh oh!

danleh commented Mar 17, 2025

Uh oh!

kripken commented Mar 17, 2025

Uh oh!

danleh commented Mar 17, 2025

Uh oh!

dschuff commented Mar 17, 2025

Uh oh!

tlively commented Mar 17, 2025

Uh oh!

dschuff commented Mar 17, 2025

Uh oh!

kripken commented Mar 17, 2025

Uh oh!

danleh commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danleh commented Mar 18, 2025

Uh oh!

MaxGraey commented Mar 18, 2025

Uh oh!

kripken commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGraey commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dschuff commented Mar 18, 2025

Uh oh!

danleh commented Mar 19, 2025

Uh oh!

arsnyder16 commented Mar 16, 2023 •

edited

Loading

danleh commented Mar 18, 2025 •

edited

Loading

kripken commented Mar 18, 2025 •

edited

Loading

MaxGraey commented Mar 18, 2025 •

edited

Loading