Skip to content

Handle entering the wasm via a table call in Asyncify #12088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

kripken
Copy link
Member

@kripken kripken commented Aug 31, 2020

Depends on #12081 (refactoring) and #12059. This removes the need
for the legacy dynCall approach for asyncify, by adding logic to note
and rewind calls that enter the wasm using the table.

Previously with dynCalls things were simpler as we only ever entered
the wasm via an export (to call a function pointer, we called an export
that did the call for us). To rewind table calls, add them to the stack
of called things.

This is not very optimized - it just serializes the function pointer and
params as a string. We should look into optimizing it, but that would
take a refactoring of how we store ids for rewinding etc., so I think
it's better to get it working first and look into optimizations later.

Fixes #12066

cc @sbc100 @Akaricchi

Base automatically changed from async1 to master August 31, 2020 17:16
@Akaricchi
Copy link
Contributor

Akaricchi commented Sep 1, 2020

Oh this is just nasty. Aside from the very idea of encoding arguments in a string being questionable at best, this further kills the already dead performance of fiber context switches...

10000000 switches in 18.223394s ~548745.198463 switches/s       ~1822 ns/switch
10000000 switches in 18.408631s ~543223.457685 switches/s       ~1840 ns/switch
10000000 switches in 18.239809s ~548251.362679 switches/s       ~1823 ns/switch
10000000 switches in 18.104195s ~552358.180668 switches/s       ~1810 ns/switch
10000000 switches in 19.524142s ~512186.406009 switches/s       ~1952 ns/switch
10000000 switches in 18.592954s ~537838.138595 switches/s       ~1859 ns/switch
10000000 switches in 17.843040s ~560442.609412 switches/s       ~1784 ns/switch
10000000 switches in 18.867718s ~530005.790910 switches/s       ~1886 ns/switch
10000000 switches in 17.893386s ~558865.709390 switches/s       ~1789 ns/switch
10000000 switches in 18.317863s ~545915.217977 switches/s       ~1831 ns/switch

These numbers are roughly 3 times worse than they are on master.

Somehow, a fully optimized Taisei build still barely manages to stay withing 60 FPS most of the time on my Ryzen 5 3600, but there are already some framerate spikes observed with ~500+ switches per frame (which really isn't that big a number). The difference is more noticeable when fast-forwarding through a replay (essentially advancing 10 game logic frames per 1 drawn frame, increasing switches/frame tenfold).

I would prefer to just keep the legacy dyncalls for Asyncify until we find a better way to resolve this that doesn't murder performance.

@Akaricchi
Copy link
Contributor

Akaricchi commented Sep 1, 2020

I guess I'm a bit confused as to why we need the two separate cases for calling exports vs. dyncalls? Can't we just interpret the rewind_id as a function pointer (which is an index into the wasmTable, i assume), and store the arguments in the Asyncify data struct somewhere? This would actually fully encode the continuation in linear memory with no JS-side state, possibly allowing a suspended fiber to be resumed on a different thread.

@kripken
Copy link
Member Author

kripken commented Sep 1, 2020

Interesting @Akaricchi , that's definitely much worse than I expected... Ok, let's work to optimize this before landing.

I pushed a commit now that avoids the string operations and stores the data using a JS object. Hopefully that's much faster, but I don't have a benchmark myself locally - is there something simple I can try?

I guess I'm a bit confused as to why we need the two separate cases for calling exports vs. dyncalls? Can't we just interpret the rewind_id as a function pointer (which is an index into the wasmTable, i assume), and store the arguments in the Asyncify data struct somewhere? This would actually fully encode the continuation in linear memory with no JS-side state, possibly allowing a suspended fiber to be resumed on a different thread.

We could do that as well, but it would be more work, I think. Maybe I'm not seeing a nice way to do it, though - if you want to write up a PR that would be great, I'm not tied to this PR in any way!

Copy link
Contributor

@Akaricchi Akaricchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still slower than master by some ~33%, but it's a huge improvement over the previous patch.

10000000 switches in 8.648807s  ~1156228.777901 switches/s      ~864 ns/switch
10000000 switches in 8.656814s  ~1155159.387555 switches/s      ~865 ns/switch
10000000 switches in 9.078460s  ~1101508.350890 switches/s      ~907 ns/switch
10000000 switches in 8.648500s  ~1156269.864030 switches/s      ~864 ns/switch
10000000 switches in 8.772179s  ~1139967.676429 switches/s      ~877 ns/switch
10000000 switches in 8.645461s  ~1156676.288792 switches/s      ~864 ns/switch
10000000 switches in 8.712480s  ~1147778.754405 switches/s      ~871 ns/switch
10000000 switches in 8.974158s  ~1114310.631599 switches/s      ~897 ns/switch
10000000 switches in 8.803610s  ~1135897.609328 switches/s      ~880 ns/switch
10000000 switches in 8.686977s  ~1151148.466371 switches/s      ~868 ns/switch

Perhaps more performance can be squeezed out of this by maintaining a list of free ident objects instead of allocating a new one every time. Ideally we should not be using arguments either because that's another implicit allocation, but that would require knowing the call signature and specializing the makeDynCall result by arity.

@Akaricchi
Copy link
Contributor

Hopefully that's much faster, but I don't have a benchmark myself locally - is there something simple I can try?

You can use the microbenchmark from koishi, which is what I'm using for my measurements as well. It's dumb and not super precise, but serviceable.

Unfortunately you may need to do some one-time setup if you've never built a Meson project with Emscripten before, but it's pretty straightforward:

  • Create a file with the following contents, change /path/to/emscripten as needed, save as ~/.local/share/meson/cross/emscripten (or somewhere else if you're not on Linux; see here for more information):
[binaries]
c=['/path/to/emscripten/emcc']
cpp=['/path/to/emscripten/em++']
ar=['/path/to/emscripten/emar']
ranlib=['/path/to/emscripten/emranlib']
pkgconfig=['/path/to/emscripten/emconfigure', 'pkg-config']
file_packager=['/usr/bin/python3', '/path/to/emscripten/tools/file_packager.py']

[properties]
root='/data/webshit/emroot/system'
needs_exe_wrapper=true

[host_machine]
system='emscripten'
cpu_family='wasm32'
cpu='wasm32'
endian='little'
  • Clone koishi and set up a build directory:
git clone https://github.com/taisei-project/koishi
cd koishi
# Replace 'emscripten' with a full path to the cross file if in non-standard location
meson setup --cross-file emscripten --buildtype release build
  • Build and run the benchmark:
cd build
ninja
node koishi_bench
  • To rebuild after an emscripten change:
ninja clean && ninja

@Akaricchi
Copy link
Contributor

We could do that as well, but it would be more work, I think. Maybe I'm not seeing a nice way to do it, though - if you want to write up a PR that would be great, I'm not tied to this PR in any way!

I'll see if i can find the time and energy for it, but no promises. I'm behind schedule on my personal projects quite a bit.

@kripken
Copy link
Member Author

kripken commented Sep 1, 2020

Sounds good, thanks @Akaricchi

I'll try to optimize this some more, but I suspect a different approach would be needed to really remove any new overhead. Specializing by arity as you said might help - not just for allocations, but it would help the JIT stay monomorphic. But that would have some cost in code size though.

Long-term I agree it would be good to write the arguments in memory etc., which would also be thread-safe. I don't think that's urgent though, and we should see that it doesn't add too much complexity.

@curiousdannii
Copy link
Contributor

If you're looking at nanosecond differences in performance, switching to Map instead of an object might help. Maps can be up to twice as fast for reads. But, it's only a small part of the whole cost.

Base automatically changed from master to main March 8, 2021 23:49
@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Apr 16, 2022
@stale stale bot closed this Jul 31, 2022
@sbc100 sbc100 deleted the async2 branch March 7, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_fibers_asyncify fails without the binaryen generated dynCall functions.
3 participants