Skip to content

Consider linking with mimalloc in release executables? #5561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TerrorJack opened this issue Mar 10, 2023 · 38 comments
Closed

Consider linking with mimalloc in release executables? #5561

TerrorJack opened this issue Mar 10, 2023 · 38 comments

Comments

@TerrorJack
Copy link
Contributor

I've seen huge performance improvement when wasm-opt is linked with mimalloc and optimizes a big wasm module on a many-cores machine!

Result sans mimalloc:

$ time bench ./test.sh
benchmarking ./test.sh
time                 221.9 s    (218.3 s .. 224.6 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 219.3 s    (218.1 s .. 220.3 s)
std dev              1.411 s    (1.155 s .. 1.566 s)
variance introduced by outliers: 19% (moderately inflated)


real    58m35.860s
user    129m50.133s
sys     2395m41.639s

Result with mimalloc:

$ time bench ./test.sh 
benchmarking ./test.sh
time                 14.06 s    (13.38 s .. 14.86 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.94 s    (13.82 s .. 14.03 s)
std dev              123.8 ms   (76.09 ms .. 151.4 ms)
variance introduced by outliers: 19% (moderately inflated)


real    3m43.584s
user    45m38.783s
sys     0m40.349s
  • The sans mimalloc cases uses the official linux x64 binaries of version_112, while the with mimalloc case is compiled from the same version, but linked with mimalloc v2.0.9
  • The command is wasm-opt -Oz hello.wasm -o hello.opt.wasm, hello.wasm is 26MB
  • The benchmark is run with bench, which runs the same command multiple times and outputs the statistics above.
  • The test is conducted in a ubuntu:22.10 container on a server with AMD EPYC 7401P (48 logical cores) and 128GB memory.
@kripken
Copy link
Member

kripken commented Mar 10, 2023

Very interesting!

Overall this make me think that maybe the issues we've seen with multithreading overhead are malloc contention between threads, like these: emscripten-core/emscripten#15727 #2740

It might be good to investigate two things here:

  • How easy mimalloc integration is (if it's a single file, and has a wasm port - which we need for binaryen.js - that would be ideal).
  • Whether we can reduce our malloc contention. We allocate Expression objects very efficiently in arenas, so this must be just other random small allocations that we do all over the place... if so then using more SmallSet/SmallVector might help. Perhaps there is a tool that can find which stack traces lead to most of these contending allocations.

@tlively
Copy link
Member

tlively commented Mar 10, 2023

Looks like Google publishes a heap profiler that might be useful for this: https://gperftools.github.io/gperftools/heapprofile.html

@TerrorJack
Copy link
Contributor Author

Linking with mimalloc to replace libc builtin allocator only requires special link-time configuration, and doesn't require changing C/C++ source code at all. When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

https://github.com/rui314/mold/blob/main/CMakeLists.txt#L138 is a good example for properly linking against mimalloc. Though it's even possible to not modify cmake config at all, just specify -DCMAKE_EXE_LINKER_FLAGS="-Wl,--push-state,--whole-archive,path/to/libmimalloc.a,--pop-state" for linux or -DCMAKE_EXE_LINKER_FLAGS="-Wl,-force_load,path/to/libmimalloc.a" for macOS at configure time.

@MaxGraey
Copy link
Contributor

When targetting wasm, you don't need to do anything special, just live with the original libc allocator.

If I am not mistaken mimalloc supports WebAssembly but only with WASI

@TerrorJack
Copy link
Contributor Author

If I am not mistaken mimalloc supports WebAssembly but only with WASI

You don't need mimalloc at all when targetting wasm.

@kripken
Copy link
Member

kripken commented Mar 10, 2023

You don't need mimalloc at all when targetting wasm

I think it could be useful in multithreaded wasm builds? Just like for native ones.

@TerrorJack
Copy link
Contributor Author

I think it could be useful in multithreaded wasm builds? Just like for native ones.

That's correct, although the mimalloc codebase currently really just supports single-threaded wasm32-wasi.

@kripken
Copy link
Member

kripken commented Mar 10, 2023

I see, makes sense.

I'd be ok with just enabling mimalloc for non-wasm for now then, if we want to go that route.

@TerrorJack
Copy link
Contributor Author

If anyone wants to give the mimalloc flavour a try, I've created a statically-linked x86_64-linux binary release of version_112 at https://nightly.link/type-dance/binaryen/actions/artifacts/593257094.zip. The build script is available at https://github.com/type-dance/binaryen/blob/main/build-alpine.sh.

@kripken
Copy link
Member

kripken commented Mar 15, 2023

Note: This issue is relevant for #4165

@tlively I looked into tcmalloc to profile our mallocs. I found some possible improvements and will open PRs, but I'm not sure how big an impact they will have. An issue is that tcmalloc measures the size of allocations, not the number of malloc calls, and we might have very many small allocations or quickly-freed ones that don't show up in that type of profiling.

@tlively
Copy link
Member

tlively commented Mar 15, 2023

I just found mutrace for profiling lock contention specifically. I'd be very interested to see the results here!

http://0pointer.de/blog/projects/mutrace.html

@kripken
Copy link
Member

kripken commented Mar 15, 2023

Interesting!

It says this:

Due to the way mutrace works we cannot profile mutexes that are used internally in glibc, such as those used for synchronizing stdio and suchlike.

So I tried both with the normal system malloc (which it seems it may not be able to profile) and with a userspace malloc (tcmalloc). The results did not change much. I guess that is consistent with malloc contention not actually being an issue on some machines, perhaps because their mallocs have low contention (like the default Linux one on my machine, and tcmalloc).

So to really dig into malloc performance we'd need to run mutrace on a system that sees the slowdown. @TerrorJack perhaps you can try that?

But the results on my machine are interesting about non-malloc mutexes. There is one top mutex by far:

mutrace: Showing 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
      10 41885864 16900980 10783056     7334.222        0.000       36.992 M-.--.
      16     2470     2286      713      114.601        0.046       24.002 M-.--.
       4      734      366        0    68689.275       93.582    19386.295 M-.--.

That mutex 10 is

Mutex #10 (0x0x7f54728ffca0) first referenced by:
	libmutrace.so(pthread_mutex_lock+0x46) [0x7f54729c1576]
	libbinaryen.so(+0x9b61d7) [0x7f54723b61d7]
	libbinaryen.so(_ZN4wasm4TypeC1ENS_8HeapTypeENS_11NullabilityE+0x41) [0x7f54723b6ae1]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder12getBasicTypeEiRNS_4TypeE+0x113) [0x7f547233aed3]
	libbinaryen.so(+0x93d529) [0x7f547233d529]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder9readTypesEv+0x4dd) [0x7f5472340f9d]
	libbinaryen.so(_ZN4wasm17WasmBinaryBuilder4readEv+0x748) [0x7f547235fe48]
	libbinaryen.so(_ZN4wasm12ModuleReader14readBinaryDataERSt6vectorIcSaIcEERNS_6ModuleENSt7__cxx1112basic_stringIcSt11char_traitsIcES2_EE+0x5c) [0x7f54723732cc]
	libbinaryen.so(_ZN4wasm12ModuleReader10readBinaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x73) [0x7f5472373503]
	libbinaryen.so(_ZN4wasm12ModuleReader4readENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_6ModuleES6_+0x17a) [0x7f5472373e1a]
	wasm-opt(+0x29a06) [0x5644f215fa06]
	libc.so.6(+0x2718a) [0x7f547144618a]

Which is the Type mutex used in wasm::Type::Type(wasm::HeapType, wasm::Nullability). Perhaps that can be improved @tlively ?

(the one after it is the thread pool mutex, wasm::ThreadPool::initialize(unsigned long), which I doubt we can improve, but also it's orders of magnitude less frequent)

@kripken
Copy link
Member

kripken commented Mar 15, 2023

(that is a profile on running wasm-opt -g -all --closed-world -tnh -O3 --type-ssa --gufa -O3 --type-merging on a large Dart testcase of Wasm GC, so it does stress type optimizations I guess)

@tlively
Copy link
Member

tlively commented Mar 15, 2023

Makes sense, this confirms a suspicion I had that the global type cache is extremely contended. We frequently do things like type == Type(heapType, Nullable) or Type::isSubType(type, Type(heapType, Nullable)), and the creation of those temporary Type objects require taking the lock. I'll take an action item to try to purge these patterns from the code base.

@kripken
Copy link
Member

kripken commented Mar 15, 2023

As another datapoint, I also ran with plain dlmalloc which doesn't have any complex per-thread pools AFAIK. But the results are the same as my system allocator and tcmalloc. So somehow I just don't see malloc contention on my machine...

@arsnyder16
Copy link
Contributor

arsnyder16 commented Mar 16, 2023

Just another data point to consider. I did some crude timings testing wasm-opt with using LD_PRELOAD and measured mimalloc vs jemalloc vs glibc allocators.

The pre optimized wasm file is ~117MB

mimalloc

real    1m28.478s
user    13m48.665s
sys     0m1.572s
jemalloc

real    1m0.543s
user    7m59.951s
sys     0m0.931s
glibc

real    1m25.956s
user    9m40.555s
sys     0m1.791s
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 
echo "mimalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-mimalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 
echo "jemalloc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf-jemalloc.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext
unset LD_PRELOAD
echo "glibc"
time /root/emsdk/upstream/bin/wasm-opt --strip-dwarf --post-emscripten -Os  --low-memory-unused --zero-filled-memory --pass-arg=directize-initial-contents-immutable --strip-debug --strip-producers  \
    perf.wasm -o perf.wasm --mvp-features --enable-threads --enable-bulk-memory --enable-mutable-globals --enable-sign-ext

Here is output running with BINARYEN_PASS_DEBUG=1
output.txt

@TerrorJack
Copy link
Contributor Author

@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed mimalloc is used at runtime by setting MIMALLOC_VERBOSE=1? Dynamic override via LD_PRELOAD doesn't seem to work at all for some reason:

$ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112)

@arsnyder16
Copy link
Contributor

Seems to be working fine for me

# MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 /root/emsdk/upstream/bin/wasm-opt --version
mimalloc: option 'show_errors': 1
mimalloc: option 'show_stats': 0
mimalloc: option 'eager_commit': 1
mimalloc: option 'deprecated_eager_region_commit': 0
mimalloc: option 'deprecated_reset_decommits': 0
mimalloc: option 'large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 0
mimalloc: option 'segment_cache': 0
mimalloc: option 'page_reset': 0
mimalloc: option 'abandoned_page_decommit': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'decommit_delay': 25
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 0
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'allow_decommit': 1
mimalloc: option 'segment_decommit_delay': 500
mimalloc: option 'decommit_extend_delay': 2
mimalloc: process init: 0x7f1a07b41f40
mimalloc: debug level : 2
mimalloc: secure level: 0
mimalloc: using 1 numa regions
wasm-opt version 112 (version_112-45-g9dcdd47a2)
heap stats:    peak      total      freed    current       unit      count
normal   1:    552 B      552 B      552 B        0          8 B       69      ok
normal   4:    4.7 KiB    4.8 KiB    4.8 KiB     32 B       32 B      155      not all freed!
normal   6:   35.6 KiB   48.2 KiB   37.5 KiB   10.7 KiB     48 B      1.0 K    not all freed!
normal   8:    9.4 KiB   22.6 KiB   16.8 KiB    5.7 KiB     64 B      361      not all freed!
normal   9:    7.2 KiB   14.2 KiB   10.4 KiB    3.8 KiB     80 B      182      not all freed!
normal  10:    2.7 KiB    4.7 KiB    3.2 KiB    1.5 KiB     96 B       50      not all freed!
normal  11:    1.8 KiB    3.9 KiB    2.7 KiB    1.2 KiB    112 B       36      not all freed!
normal  12:   18.6 KiB   19.5 KiB   18.5 KiB    1.0 KiB    128 B      156      not all freed!
normal  13:    1.5 KiB    3.1 KiB    2.1 KiB    960 B      160 B       20      not all freed!
normal  14:    768 B      1.3 KiB    960 B      384 B      192 B        7      not all freed!
normal  15:    672 B      1.7 KiB    1.3 KiB    448 B      224 B        8      not all freed!
normal  16:    512 B      512 B      256 B      256 B      256 B        2      not all freed!
normal  17:    320 B      320 B      320 B        0        320 B        1      ok
normal  18:    768 B      768 B      768 B        0        384 B        2      ok
normal  19:    448 B      896 B      896 B        0        448 B        2      ok
normal  21:    640 B      640 B      640 B        0        640 B        1      ok
normal  23:    1.7 KiB    3.5 KiB    3.5 KiB      0        896 B        4      ok
normal  25:    1.2 KiB    2.5 KiB    1.2 KiB    1.2 KiB    1.2 KiB      2      not all freed!
normal  27:    3.5 KiB    7.0 KiB    7.0 KiB      0        1.7 KiB      4      ok
normal  29:    2.5 KiB    2.5 KiB    2.5 KiB      0        2.5 KiB      1      ok
normal  31:   10.5 KiB   14.0 KiB   14.0 KiB      0        3.5 KiB      4      ok
normal  33:    5.0 KiB    5.0 KiB    5.0 KiB      0        5.0 KiB      1      ok
normal  35:   14.0 KiB   14.0 KiB   14.0 KiB      0        7.0 KiB      2      ok
normal  37:   10.0 KiB   10.0 KiB   10.0 KiB      0       10.0 KiB      1      ok
normal  41:   20.0 KiB   20.0 KiB   20.0 KiB      0       20.0 KiB      1      ok
normal  45:   40.1 KiB   40.1 KiB      0       40.1 KiB   40.1 KiB      1      not all freed!

heap stats:    peak      total      freed    current       unit      count
    normal:  142.7 Ki   231.1 Ki   166.9 Ki    64.2 Ki     112 B      2.1 K    not all freed!
     large:      0          0          0          0                            ok
      huge:      0          0          0          0                            ok
     total:  142.7 KiB  231.1 KiB  166.9 KiB   64.2 KiB                        not all freed!
malloc req:  128.6 KiB  206.6 KiB  147.9 KiB   58.6 KiB                        not all freed!

  reserved:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
 committed:   64.0 MiB   64.0 MiB      0       64.0 MiB                        not all freed!
     reset:      0          0          0          0                            ok
   touched:  357.5 KiB  379.8 KiB   99.5 KiB  280.3 KiB                        not all freed!
  segments:      1          1          0          1                            not all freed!
-abandoned:      0          0          0          0                            ok
   -cached:      0          0          0          0                            ok
     pages:     23         29         16         13                            not all freed!
-abandoned:      0          0          0          0                            ok
 -extended:     48
 -noretire:     22
     mmaps:      1
   commits:      0
   threads:      0          0          0          0                            ok
  searches:     0.3 avg
numa nodes:       1
   elapsed:       0.002 s
   process: user: 0.002 s, system: 0.000 s, faults: 0, rss: 9.0 MiB, commit: 64.0 MiB
mimalloc: process done: 0x7f1a07b41f40

kripken added a commit that referenced this issue Mar 21, 2023
This makes the pass 2-3% faster in some measurements I did locally.

Noticed when profiling for #5561 (comment)

Helps #4165
kripken added a commit that referenced this issue Apr 5, 2023
This saves the work of freeing and allocating for all the other maps. This is a
code path that is used by several passes so it showed up in profiling for
#5561
radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
This makes the pass 2-3% faster in some measurements I did locally.

Noticed when profiling for WebAssembly#5561 (comment)

Helps WebAssembly#4165
radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
This saves the work of freeing and allocating for all the other maps. This is a
code path that is used by several passes so it showed up in profiling for
WebAssembly#5561
@danleh
Copy link
Contributor

danleh commented Mar 11, 2025

I stumbled over this issue when compiling/optimizing a Kotlin/Wasm application, specifically the benchmarks in https://github.com/JetBrains/compose-multiplatform/tree/ok/benchmarks_d8/benchmarks/multiplatform via ./gradlew clean :benchmarks:wasmJsProductionExecutableCompileSync. Changing from my Linux system allocator to mimalloc improves wasm-opt runtime by >30%. More details, repro instructions, and allocation profiles below. CC @eymar

To easily reproduce this, simply download the attached input WasmGC binary and measure via /usr/bin/time -v wasm-opt --enable-gc --enable-reference-types --enable-exception-handling --enable-bulk-memory --enable-nontrapping-float-to-int --closed-world --inline-functions-with-loops --traps-never-happen --fast-math --type-ssa -O3 -O3 --gufa -O3 --type-merging -O3 -Oz input/compose-benchmarks-benchmarks-wasm-js.wasm -o output.wasm (the flags are what Kotlin/Wasm uses).

wasm-opt version_119 takes >8 minutes (output of /usr/bin/time -v wasm-opt ...) on a high core count x64 machine:

User time (seconds): 961.47
System time (seconds): 52019.90
Percent of CPU this job got: 10831%
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:09.15
[...]
Maximum resident set size (kbytes): 841232

See the very high system time. Looking at it with perf top -g shows this is because of lock contention in the allocator.

wasm-opt version_122-69-g0472ba2cf (I compiled locally from the current main branch) is already a vast improvement:

User time (seconds): 238.92
System time (seconds): 9.31
Percent of CPU this job got: 528%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.98
[...]
Maximum resident set size (kbytes): 753204

I assume the changes linked above reduced the number of allocations drastically, hence removing most of the allocator lock contention.

However, this can be sped up by another 30% (!) by using mimalloc as an allocator. Specifically:

sudo apt install libmimalloc2.0
# Make sure it's taken up by wasm-opt via LD_PRELOAD:
MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 wasm-opt --version
# Should print something like:
mimalloc: process init: 0x7F5DA235E400
[...]
# Then run with:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 /usr/bin/time -v wasm-opt ... (rest of the arguments from above)

This gives an even better runtime (at the cost of higher max RSS):

User time (seconds): 213.09
System time (seconds): 6.26
Percent of CPU this job got: 714%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:30.68
[...]
Maximum resident set size (kbytes): 1157180

@danleh
Copy link
Contributor

danleh commented Mar 11, 2025

Forgot the binary for the repro, attached: compose-benchmarks-benchmarks-wasm-js.zip

@danleh
Copy link
Contributor

danleh commented Mar 11, 2025

I also tried to briefly check if there are more allocations one could get rid of, both by regular profiling and taking a heap profile from a RelWithDebInfo build:

cd path/to/binaryen/
cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo && make
cd path/to/repro
perf record -g -k1 --freq=1000 path/to/wasm-opt ... && pprof -flame perf.data

gives this profile (sorry, Googlers-only): https://pprof.corp.google.com/?id=4b7bcac9402431aa89920010f8724f15

I am not a binaryen expert, but wasm::GlobalTypeRewriter::rebuildTypes and wasm::Walker::walk come up quite a bit as parents of malloc and free (see the Bottom-Up profile). Is that expected?

I also quickly took a heap profile with tcmalloc from gperftools:

LD_PRELOAD=~/gperftools/.libs/libtcmalloc_and_profiler.so MALLOCSTATS=1 HEAP_PROFILE_ALLOCATION_INTERVAL=100000000 HEAPPROFILE=tcmalloc path/to/wasm-opt ...
# This produces several heapdumps, after each 100MB allocated. You can then inspect one of the later ones with:
pprof -flame tcmalloc.0226.heap

which gives this heap profile: https://pprof.corp.google.com/?id=862bc2615a922858148915a281818d11&metric=alloc_objects&tab=bottomup

Again, I am not a binaryen expert, but there are a couple of heavily allocating or reallocating functions:

  • wasm::Type::getHeapTypeChildren is responsible for >7% of all allocations via std::__new_allocator::allocate (expand the latter in the bottom-up profile recursively, until you find getHeapTypeChildren)
  • wasm::Type::getHeapTypeChildren is responsible for >8% of all allocations via _M_realloc_append (same story: expand the latter in the bottom-up profile until you see it).
  • In total getHeapTypeChildren seems to be responsible for >25% of all allocations, see https://pprof.corp.google.com/?id=862bc2615a922858148915a281818d11&metric=alloc_objects&filter=focus:getHeapTypeChildren&tab=flame
  • wasm::PassUtils::FilteredPass::create is responsible for >6% of all allocations via std::make_unique.
  • wasm::PostWalker::scan or wasm::Walker::pushTask (not sure, one is partially inlined) seem to contain a wasm::SmallVector that becomes too large and then is placed on the heap (>4% of all allocations). Maybe one can just increase the inline size a bit to avoid more heap allocations?
  • 4% of all allocations are from hash table inserts in wasm::InsertOrderedMap::insert. Maybe one could replace the std::unordered_map here with a C++23 std::flat_map or absl::flat_hash_map and pre-size that if possible?
  • 4% of all allocations are from std::set::insert in wasm::EffectAnalyzer::InternalAnalyzer::visitLocalGet (I think here). Maybe use a flat_set here as well (C++23 or Abseil)?

@kripken
Copy link
Member

kripken commented Mar 11, 2025

The big speedup since 119 is likely the large amount of work we did a few months ago, mentioned in #4165

::walk etc. is expected - every pass does a walk over the IR, and that is the core method. There are at least no obvious optimizations left there... rebuildTypes is more surprising, but our core Type and HeapType classes are not arena-allocated (and can't easily be), so a module with lots of types and lots of type optimization opportunities can end up working a lot there, I guess. Indeed, looking at slower passes at a high level with BINARYEN_PASS_DEBUG=1, slow passes include

signature-pruning
signature-refining
gto
type-refining
abstract-type-refining

All those rebuild the type graph when they find things to optimize.

The allocations are interesting, thanks for that analysis. Those are definitely worth looking into.

@danleh
Copy link
Contributor

danleh commented Mar 17, 2025

@arsnyder16 Thanks for conducting the experiment. Have you actually confirmed mimalloc is used at runtime by setting MIMALLOC_VERBOSE=1? Dynamic override via LD_PRELOAD doesn't seem to work at all for some reason:

$ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so.2 ./wasm-opt --version
wasm-opt version 112 (version_112)

Ha, I've found out why using mimalloc via LD_PRELOAD didn't work for you! The official Linux binaries of Binaryen (including wasm-opt) are built via

and that produces a static binary. You can confirm that via:

$ ldd binaryen-version_122/bin/wasm-opt
        not a dynamic executable

This uses musl libc (since glibc cannot be statically linked), and that uses an allocator that is suffering from heavy lock contention in multi-threaded workloads, see e.g., https://nickb.dev/blog/default-musl-allocator-considered-harmful-to-performance/

In other words, all users that don't build binaryen from source but use the Linux binaries from https://github.com/WebAssembly/binaryen/releases will suffer from this allocator, especially on machines with many cores.

The solution is probably to switch to dynamic linking, which then will use glibc; or if a static binary is required, to use a better allocator (mimalloc supports static linking, see https://github.com/microsoft/mimalloc?tab=readme-ov-file#using-the-library)

@kripken
Copy link
Member

kripken commented Mar 17, 2025

Thanks @danleh !

I believe we build on Alpine in order to get a single static build that can run on as many Linux distros as possible. Getting a dynamic build to run that way is harder IIUC, but I am not an expert on this.

@dschuff @sbc100 Do you know what we should be doing here? And how do we build the emsdk Binaryen - is that also linked statically with malloc/free? (If so perhaps it is slow as well, or if not, perhaps we can do the same here as we do there?)

danleh added a commit to danleh/binaryen that referenced this issue Mar 17, 2025
Which fixes performance problems, especially on heavily multi-threaded workloads (many functions, high core count machines), see WebAssembly#5561.
@danleh
Copy link
Contributor

danleh commented Mar 17, 2025

I believe we build on Alpine in order to get a single static build that can run on as many Linux distros as possible. Getting a dynamic build to run that way is harder IIUC, but I am not an expert on this.

Right, makes sense.

If adding mimalloc to third_party/ is an option, a static build using mimalloc works for me locally, see #7378.

Build instructions:

git checkout mimalloc-static
git clean -d -f -x
cmake . -DCMAKE_CXX_FLAGS="-static" -DCMAKE_C_FLAGS="-static" -DCMAKE_BUILD_TYPE=Release -DBUILD_STATIC_LIB=ON
make -j64 wasm-opt
MIMALLOC_VERBOSE=1 bin/wasm-opt --version

gives (i.e., uses mimalloc)

mimalloc: process init: 0x3E198C40
...

and is statically linked

$ ldd bin/wasm-opt 
        not a dynamic executable

@dschuff
Copy link
Member

dschuff commented Mar 17, 2025

The solution is probably to switch to dynamic linking
Is the goal here to allow users to interpose on malloc and free via LD_PRELOAD?

If the goal is just to link mimalloc from a dynamic library (some prebuilt distribution?) instead of linking it statically from Alpine's musl, I think that should be possible. It would of course also be possible to link it statically (via a prebuilt archive file, or by building it ourselves).
I was wondering if it might somehow be possible to link malloc and free via the dynamic symbol table (as if they came from a dylib) but have them linked into the program rather than a dylib; this would allow interposition via LD_PRELOAD but would allow us to keep shipping a maximally-compatible Alpine binary. I think it's possible in theory but unfortunately not with the existing GNU ld.

The emsdk binaryen is dynamically linked against glibc, so interposition is possible. We use Chromium's infrastructure to build that, and Chromium's solution for good compatibility is to build in a sysroot which has a pretty old version of glibc, and while not as maximally-compatible as statically linking libc, it works in practice on most systems. I don't recall hearing any requests to work on older systems.

Having said all that, I think the following are what I have historically thought we should focus on:

  1. Our main focus for binary performance should be on Emscripten users, since they are the vast majority of users of binary packages AFAIK.
  2. Second focus should be on those incorporating Binaryen into their toolchains, but IIUC they are mostly not using our Alpine packages (nor even our releases? My understanding had been that they mostly build themselves from tip of tree) . So for these users we should maybe focus on making it easy to build Binaryen in a performant way. (also, am I wrong about this? Maybe my view is skewed based on the ones who talk to us and help us develop new features).
  3. Let's not forget about the wasm build of Binaryen. My understanding is that it's made its way into some node/npm-based workflows, but I don't have a good sense of how much traction it has there.

@danleh's experience suggests that I could be wrong about 2 (obviously he's using our prebuilt, but not Alpine). It seems OP is in the "DIY build" category.

Having said all that, including mimalloc in third_party as a submodule (or just a snapshot), and adding a CMake option seems like it would be pretty easy, and maybe we should just do that, if there's a good performance win compared to glibc and/or the default mac/windows allocators.

@tlively
Copy link
Member

tlively commented Mar 17, 2025

Kotlin uses our release artifacts and Dart might as well (they certainly use our tagged releases). I think anything we can do to speed up our release artifacts is well worth doing. (@danleh already filed #7378 to add mimalloc to third_party and that sgtm.)

@dschuff
Copy link
Member

dschuff commented Mar 17, 2025

Ah great, this sounds good to me too. Maybe we can do performance tests (or solicit them from users) on Mac and Windows too, and figure out where exactly we should enable it by default.

@kripken
Copy link
Member

kripken commented Mar 17, 2025

@dschuff

Let's not forget about the wasm build of Binaryen.

Funnily enough, we already do that,

binaryen/CMakeLists.txt

Lines 321 to 323 in 9b161a9

# Use mimalloc to avoid a 5x slowdown:
# https://github.com/emscripten-core/emscripten/issues/15727#issuecomment-1960295018
add_link_flag("-sMALLOC=mimalloc")

We did this after

https://web.dev/articles/scaling-multithreaded-webassembly-applications#mimalloc

So maybe the wasm build has been faster than the native one... I didn't imagine a native build could have similar issues. Good find @danleh !

@danleh
Copy link
Contributor

danleh commented Mar 18, 2025

Let's put concrete comments/suggestions on how to integrate and link mimalloc in the PR #7378.

Regarding more general discussion and questions:

Second focus should be on those incorporating Binaryen into their toolchains, but IIUC they are mostly not using our Alpine packages (nor even our releases? My understanding had been that they mostly build themselves from tip of tree)

As @tlively said, I don't think that's true (anymore). At least Kotlin (via Gradle) use your tagged releases, and I am pretty sure average users often go the "path of least resistance" and just take your binaries (instead of building from source). (CC @eymar from Kotlin/JetBrains.)

The emsdk binaryen is dynamically linked against glibc, so interposition is possible. We use Chromium's infrastructure to build that, and Chromium's solution for good compatibility is to build in a sysroot which has a pretty old version of glibc, and while not as maximally-compatible as statically linking libc, it works in practice on most systems.

That's pretty cool and at least in principle I would prefer dynamic linking of an old glibc over static linking of musl + mimalloc, for three reasons: (i) code size (even before mimalloc, the static wasm-opt x64 Linux build is >18MB), (ii) "familiarity" / less surprises (such as these performance problems), and (iii) more seamless integration with existing tools (e.g., one can not run heaptrack or tcmalloc's heapprofiling against the statically linked binaryen release binaries).

On the other hand, pulling in Chromium's infrastructure only for more portable binaries seems like total overkill, and you probably had good reasons to go the Alpine Linux / static linking route in the past. So I am fine with just statically linking mimalloc for fixing this particular performance issue.

if there's a good performance win compared to glibc and/or the default mac/windows allocators.
Maybe we can do performance tests (or solicit them from users) on Mac and Windows too, and figure out where exactly we should enable it by default.

I don't have numbers on Windows or Mac OS, but I think their allocator performance is much better than musl libc's in multi-threaded workloads, so the relative benefit of mimalloc over them is much smaller there. In other words: I don't think there is urgency for Windows or Mac, and I enabled mimalloc only for the static Linux build for now.

@danleh
Copy link
Contributor

danleh commented Mar 18, 2025

And just to clarify explicitly, I had originally mixed up these points:

  1. There have been performance improvements in wasm-opt from version 119 to 122.
  2. glibc's allocator (used in a non-static build, e.g., when building myself from source) is much better than musl libc's allocator (used in the tagged/released Linux binaries).
  3. mimalloc is a bit better than glibc's allocator in terms of runtime, but has higher memory footprint / max RSS.

Concrete numbers from my 128 core x64 machine, optimizing a 12 MB WasmGC binary (see #5561 (comment)):

# config libc linkage allocator wall time sys time max RSS (KB)
1 official Linux binary, version 119 musl static musl default 8min 3s 52004s 812644
2 official Linux binary, version 122 musl static musl default 6min 51s 43328s 739972
3 release build from source at version_119 tag glibc dynamic glibc default 0min 56s 1145s 809272
4 release build from source at version_122 tag glibc dynamic glibc default 0min 45s 9s 731068
5 release build from source at main branch glibc dynamic glibc default 0min 44s 9s 724360
6 release build from source at main + LD_PRELOAD of mimalloc glibc dynamic mimalloc 0min 29s 9s 1112704

(Building C++ with a statically linked musl libc is a pain on Ubuntu, hence I only have numbers from the prebuilt binaries and dynamically linking glibc.)

@MaxGraey
Copy link
Contributor

1.5x faster runtime but also 1.5x more RSS. Trade memory for speed and vise verse rule in action.

@kripken
Copy link
Member

kripken commented Mar 18, 2025

@MaxGraey Compare lines 2 and 4 (122 with musl's allocator, 122 with some other allocator). The difference is almost 10x - I guess musl's allocator is just not that good at heavily multithreaded allocations.

I do agree 50% more memory is a downside though. It seems worth it in the official linux binaries for a 9x speedup, but for dynamic builds where the speedup is is just 1.5x, maybe not.

@MaxGraey
Copy link
Contributor

MaxGraey commented Mar 18, 2025

I wonder if it would be better musl with static linkage + static mimalloc linkages (which should override musl's allocator). In theory it should be a little faster than dynamic mimalloc & glibc and should definitely reduce RSS.

@dschuff
Copy link
Member

dschuff commented Mar 18, 2025

That's pretty cool and at least in principle I would prefer dynamic linking of an old glibc over static linking of musl + mimalloc, for three reasons: (i) code size (even before mimalloc, the static wasm-opt x64 Linux build is >18MB), (ii) "familiarity" / less surprises (such as these performance problems), and (iii) more seamless integration with existing tools (e.g., one can not run heaptrack or tcmalloc's heapprofiling against the statically linked binaryen release binaries).

i) Is 18MB a lot? For the emscripten release builds we statically link libc++ and libbinaryen; wasm-opt is 19MB, clang is 131MB and the whole package is 300MB compressed and 1.2G uncompressed. We don't get too many (?) complaints about that, so I'm not sure I'd worry about code size too much.
iii) Preserving this capability would require not only dynamically linking libc, but also mimalloc itself. I think that's doable (at least on ELF) but not necessarily the recommended way to do it.

I don't think we'd need to bring all of Chromium's infrastructure in here. We could probably get away with just pulling in the sysroot. Or maybe there's an older Linux image available on Github than the one we are using.
Maybe the easiest thing to do first is just replace the Alpine build with something like that, and dynamically link against glibc.

@danleh
Copy link
Contributor

danleh commented Mar 19, 2025

I wonder if it would be better musl with static linkage + static mimalloc linkages (which should override musl's allocator). In theory it should be a little faster than dynamic mimalloc & glibc and should definitely reduce RSS.

Numbers with glibc + mimalloc statically linked:

# config libc linkage allocator wall time sys time max RSS (KB)
7 release build of PR #7378 with -static + -DMIMALLOC_STATIC=ON glibc static (not recommended) mimalloc 0min 30s 9s 1177720

In other words, at least for this workload and configuration (glibc, mimalloc), static linking doesn't make a difference (compare with row 6 above). As I said, building C++ code with musl on Ubuntu is a bit tedious (you have to build libstdc++ yourself from sources), so I don't have numbers for it.

i) Is 18MB a lot? For the emscripten release builds we statically link libc++ and libbinaryen; wasm-opt is 19MB, clang is 131MB and the whole package is 300MB compressed and 1.2G uncompressed. We don't get too many (?) complaints about that, so I'm not sure I'd worry about code size too much.

Thanks for these ballpark numbers. Makes sense, then binary size here is not really a concern.

kripken pushed a commit that referenced this issue Mar 19, 2025
The new CMake flag MIMALLOC_STATIC controls this.

mimalloc fixes perf problems, especially on heavily multi-threaded
workloads (many functions, high core count machines) on the musl
allocator, see #5561.
@kripken
Copy link
Member

kripken commented Mar 19, 2025

Fixed by #7378 , which uses mimalloc on Linux static builds (including our official ones), which is the only place we are aware of issues atm.

@kripken kripken closed this as completed Mar 19, 2025
@dschuff
Copy link
Member

dschuff commented Mar 19, 2025

Just for completeness, I'll mention that I actually experimented with using the Chromium sysroot instead of Alpine for the release build, but I forgot that the sysroot doesn't have a new enough version of libstdc++ for Binaryen. Chrome (and our emsdk build) solve this by building our own version of libc++ to go along with the app, and we could certainly do that here too. It's just more of a pain than what we have here.
I guess the only thing missing is the ability to dynamically override malloc in the release build for profiling, but I guess Daniel already proved that it's easy enough to just build your own if you want to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants