-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Handle entering the wasm via a table call in Asyncify #12088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Oh this is just nasty. Aside from the very idea of encoding arguments in a string being questionable at best, this further kills the already dead performance of fiber context switches...
These numbers are roughly 3 times worse than they are on master. Somehow, a fully optimized Taisei build still barely manages to stay withing 60 FPS most of the time on my Ryzen 5 3600, but there are already some framerate spikes observed with ~500+ switches per frame (which really isn't that big a number). The difference is more noticeable when fast-forwarding through a replay (essentially advancing 10 game logic frames per 1 drawn frame, increasing switches/frame tenfold). I would prefer to just keep the legacy dyncalls for Asyncify until we find a better way to resolve this that doesn't murder performance. |
I guess I'm a bit confused as to why we need the two separate cases for calling exports vs. dyncalls? Can't we just interpret the |
Interesting @Akaricchi , that's definitely much worse than I expected... Ok, let's work to optimize this before landing. I pushed a commit now that avoids the string operations and stores the data using a JS object. Hopefully that's much faster, but I don't have a benchmark myself locally - is there something simple I can try?
We could do that as well, but it would be more work, I think. Maybe I'm not seeing a nice way to do it, though - if you want to write up a PR that would be great, I'm not tied to this PR in any way! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still slower than master by some ~33%, but it's a huge improvement over the previous patch.
10000000 switches in 8.648807s ~1156228.777901 switches/s ~864 ns/switch
10000000 switches in 8.656814s ~1155159.387555 switches/s ~865 ns/switch
10000000 switches in 9.078460s ~1101508.350890 switches/s ~907 ns/switch
10000000 switches in 8.648500s ~1156269.864030 switches/s ~864 ns/switch
10000000 switches in 8.772179s ~1139967.676429 switches/s ~877 ns/switch
10000000 switches in 8.645461s ~1156676.288792 switches/s ~864 ns/switch
10000000 switches in 8.712480s ~1147778.754405 switches/s ~871 ns/switch
10000000 switches in 8.974158s ~1114310.631599 switches/s ~897 ns/switch
10000000 switches in 8.803610s ~1135897.609328 switches/s ~880 ns/switch
10000000 switches in 8.686977s ~1151148.466371 switches/s ~868 ns/switch
Perhaps more performance can be squeezed out of this by maintaining a list of free ident objects instead of allocating a new one every time. Ideally we should not be using arguments
either because that's another implicit allocation, but that would require knowing the call signature and specializing the makeDynCall
result by arity.
You can use the microbenchmark from koishi, which is what I'm using for my measurements as well. It's dumb and not super precise, but serviceable. Unfortunately you may need to do some one-time setup if you've never built a Meson project with Emscripten before, but it's pretty straightforward:
[binaries]
c=['/path/to/emscripten/emcc']
cpp=['/path/to/emscripten/em++']
ar=['/path/to/emscripten/emar']
ranlib=['/path/to/emscripten/emranlib']
pkgconfig=['/path/to/emscripten/emconfigure', 'pkg-config']
file_packager=['/usr/bin/python3', '/path/to/emscripten/tools/file_packager.py']
[properties]
root='/data/webshit/emroot/system'
needs_exe_wrapper=true
[host_machine]
system='emscripten'
cpu_family='wasm32'
cpu='wasm32'
endian='little'
git clone https://github.com/taisei-project/koishi
cd koishi
# Replace 'emscripten' with a full path to the cross file if in non-standard location
meson setup --cross-file emscripten --buildtype release build
cd build
ninja
node koishi_bench
ninja clean && ninja |
I'll see if i can find the time and energy for it, but no promises. I'm behind schedule on my personal projects quite a bit. |
Sounds good, thanks @Akaricchi I'll try to optimize this some more, but I suspect a different approach would be needed to really remove any new overhead. Specializing by arity as you said might help - not just for allocations, but it would help the JIT stay monomorphic. But that would have some cost in code size though. Long-term I agree it would be good to write the arguments in memory etc., which would also be thread-safe. I don't think that's urgent though, and we should see that it doesn't add too much complexity. |
If you're looking at nanosecond differences in performance, switching to |
This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant. |
Depends on #12081 (refactoring) and #12059. This removes the need
for the legacy dynCall approach for asyncify, by adding logic to note
and rewind calls that enter the wasm using the table.
Previously with dynCalls things were simpler as we only ever entered
the wasm via an export (to call a function pointer, we called an export
that did the call for us). To rewind table calls, add them to the stack
of called things.
This is not very optimized - it just serializes the function pointer and
params as a string. We should look into optimizing it, but that would
take a refactoring of how we store ids for rewinding etc., so I think
it's better to get it working first and look into optimizations later.
Fixes #12066
cc @sbc100 @Akaricchi