Skip to content

Asyncify for goroutines in WebAssembly? #1101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kripken opened this issue May 12, 2020 · 22 comments
Closed

Asyncify for goroutines in WebAssembly? #1101

kripken opened this issue May 12, 2020 · 22 comments
Labels
enhancement New feature or request wasm WebAssembly

Comments

@kripken
Copy link

kripken commented May 12, 2020

Context: @aykevl wrote:

Stackful coroutines (called goroutines). We've had to go through great pains to get these working in TinyGo and we're still hitting corner cases where it doesn't work. If it works, it's inefficient.

The discussion there on new wasm capabilities is the right long-term solution, of course, but I wonder if maybe until then, Asyncify might be useful here?

With Asyncify you run a utility on a wasm file, and get a modified wasm file that can pause and resume execution at various points. It's easy to use - you don't need to think about how it works or handle corner cases, just tell it which calls can pause, and set up a little runtime code to call things. This does add overhead, about 50% in size and speed on average in the worst cases (where it ends up instrumenting almost everything), but in the best cases it's very efficient - for example, it can avoid instrumenting inner loops where possible, letting such code run at full speed.

The idea here would be to leave many of the low-level goroutine pause/resume details to Asyncify. All TinyGo would do is have some special function call (which you don't need to implement; it's a marker for Asyncify) that indicates a pause/resume point. Asyncify is run on the wasm. Then when the wasm runs, a goroutine that reaches such a point can return to the runtime code, which can then decide which other goroutine to proceed to run.

If there's interest I'd be happy to collaborate on investigating this! (I wrote Asyncify, and have helped integrate it in various projects, but I don't know that much about TinyGo.)

@niaow
Copy link
Member

niaow commented May 12, 2020

This is unfortunately used on platforms other than WASM. I am currently working on a rebuild of the system which should work better than the current version (and probably better than asyncify).

@kripken
Copy link
Author

kripken commented May 12, 2020

I see, thanks @jaddr2line ! Sounds irrelevant then, closing...

@kripken kripken closed this as completed May 12, 2020
@niaow
Copy link
Member

niaow commented Jul 19, 2021

It probably makes more sense to switch to asyncify now. Asyncify is more mature, most other platforms have switched to stack-based goroutines. Additionally the previously mentioned rebuild of the coroutines system didnt happen.

@niaow niaow reopened this Jul 19, 2021
@aykevl
Copy link
Member

aykevl commented Jul 19, 2021

I have investigated this, and in fact looked into the feasibility of reimplementing an Asyncify-like pass in Go (to avoid a large JavaScript dependency) but there is one very big downside: it destroys debug information. This is important, see #1994 for why. Unless Asyncify learns to retain debug information, I'm afraid it is a blocker.

@kripken
Copy link
Author

kripken commented Jul 19, 2021

Binaryen's Asyncify pass does support debug information, both the Names section as well as DWARF. We don't update all parts of DWARF yet (like variable locations in wasm locals), but we do update the lines section and so things like stack traces, as discussed in #1994, should remain 100% correct.

As I mentioned when I opened this issue, I'd be very happy to help out here with Asyncify if there's interest! But I do get that adding Binaryen as a dependency has downsides. @aykevl is that what you mean by "a large JavaScript dependency"? Binaryen can also be compiled to other things, including wasm, native code, and to C. (I suspect we could also get it to compile to Go... with some effort on generalizing wasm2c.)

@aykevl
Copy link
Member

aykevl commented Jul 19, 2021

Oh wow, that is very good news! In that case, it would become a real possibility.

Yes, a Go dependency would definitely be preferred. C would also work with CGo. I would strongly wish to avoid a runtime dependency and preferably also wish to avoid a compile time dependency. However, if it's possible to drop a few C files in a package with some Go wrapper code, that would be perfect.

I haven't looked much into what would be needed. In short, for other ISAs there is a function to switch between goroutines (tinygo_swapTask) and a function that sets up the state for a first goroutine (tinygo_startTask).

I'm a bit tired right now. I'll try to better describe this soon.

@kripken
Copy link
Author

kripken commented Jul 20, 2021

Interesting! Hopefully this is useful then. Eager to hear more details of what you have in mind @aykevl

About integration, we can emit a single C file of all of Binaryen, either as an executable or a library. It doesn't run as fast as a normal build (no multithreading), but it's easy to distribute. For the initial testing though it might be easier to use a native build.

There would not be any runtime dependency. Just compile time. (However, your existing runtime would need some code to integrate with Asyncify, that is, to call unwind/rewind at the proper times etc. I assume that would be small.)

@niaow
Copy link
Member

niaow commented Jul 20, 2021

I am a bit confused, I don't think we need anything other than wasm-opt? From my understanding, the code would just hook into asyncify.start_unwind and the other intrinsics which would be rewritten by the pass?

@kripken
Copy link
Author

kripken commented Jul 20, 2021

Yes, at compile time just wasm-opt is needed. But wasm-opt is almost all of Binaryen as it includes the full optimizer. We could perhaps make a more minimalistic version, but the optimizer is fairly important in this case, as optimizing the code after the async transform helps a lot (see "the importance of optimization", the second section before last, in the asyncify post).

At runtime, just calling the intrinsics asyncify.start_unwind etc. would be necessary, correct.

@niaow
Copy link
Member

niaow commented Jul 21, 2021

This was a bit different from a stack-switching system, so I did not try to merge it with the tasks scheduler.
There are 2 stacks needed:

  1. The C stack (used to store locals which we need pointers to)
  2. The asyncify stack (used to persist WASM locals)

I was originally planning to place the asyncify stack on top of the C stack, since no C stack pushes/pops should be happening while the asyncify stack is in use. However, since these grow in opposite directions (C stack grows down, asyncify stack grows up), I instead placed these inside the same buffer such that they grow inwards.

I am not entirely sure what to do with the GC right now. Most likely, this will need to be done by modifying the stack slots system to save the linked list when we switch goroutines. I was hoping we could use this to get conservative stack scanning to work (stack slots have some bugs), but that seems like it would require an extra stack to run the GC.

The WASM assembly to handle the unwinding and rewinding seems a bit complicated. Right now I copied the direct-call-to-start technique used by the coroutines scheduler because it was difficult to merge the initial-start and resume cases together correctly.

@aykevl
Copy link
Member

aykevl commented Jul 21, 2021

@niaow thank you for looking into this!

Regarding how to include this in TinyGo, I was referring to the TinyGo build (not using TinyGo to build wasm binaries). If possible, I would very much prefer if users won't need to install extra tools to use TinyGo. Requiring extra tools causes way too many headaches, see avr-gcc for example. For WebAssembly, TinyGo currently does not require any external tools in the release build: clang and wasm-ld are both built into the binary directly. For developer builds that use go install, these two binaries need to be present however (but are often versioned so won't cause version mismatches on Linux). Needing wasm-opt present in the $PATH variable would be unfortunate but usable for go install builds, but I would very much prefer if users didn't need to install wasm-opt when using a release build of TinyGo.

@kripken From a quick look I see that Binaryen is a pretty big C++ program. Is there a way to link to it as a library, via C bindings? That would make it much easier to link it statically into the TinyGo binary for releases.

@aykevl is that what you mean by "a large JavaScript dependency"?

Oops, I assumed Binaryen was written in JavaScript. Clearly that's not the case, which makes distribution a lot easier.

@niaow I see that you are looking into Asyncify for TinyGo, which is great! However, I wonder if it wouldn't be easier to first sort out how to add Binaryen as a TinyGo dependency? For example, I think the optimizer might be useful in and of itself to reduce code size and possibly improve speed (although I haven't tested this) - right now we're just using LLVM. It would certainly be a lot simpler to review if these two things are split.

@kripken
Copy link
Author

kripken commented Jul 21, 2021

@aykevl

@kripken From a quick look I see that Binaryen is a pretty big C++ program. Is there a way to link to it as a library, via C bindings? That would make it much easier to link it statically into the TinyGo binary for releases.

Yes, Binaryen has a C API and can be built as a library (that's the default actually in CMake).

@niaow
Copy link
Member

niaow commented Jul 21, 2021

Yeah these two things can be split, I was mostly trying to just see how to use it inside of TinyGo's runtime.

And yeah, the C API doesn't seem too bad.
We will need to fork though to run tools.
I think that we can statically link it?
I can take a look tommorow.

@aykevl
Copy link
Member

aykevl commented Jul 21, 2021

Yes, Binaryen has a C API and can be built as a library (that's the default actually in CMake).

Awesome!

Yeah these two things can be split, I was mostly trying to just see how to use it inside of TinyGo's runtime.

I understand.

Oh and one final note:

I am not entirely sure what to do with the GC right now. Most likely, this will need to be done by modifying the stack slots system to save the linked list when we switch goroutines. I was hoping we could use this to get conservative stack scanning to work (stack slots have some bugs), but that seems like it would require an extra stack to run the GC.

I was actually thinking the same thing: asyncify and GC stack scanning both need to have some sort of stack access, so they might work well together. My initial thought of how this could work is as follows:

  • Functions which (directly or indirectly) may call runtime.GC need to save their locals to the C stack when they call one of these functions.
  • Functions which (directly or indirectly) may call a stack switching function need to save their locals to the C stack when they call one of these functions (just like before) but also need to have all the usual branches and stuff that asyncify does.

So essentially the first case is a subset of the second case. The difference here with current Asyncify is that it saves everything to the C stack instead of to a custom buffer, which would make it much more similar to other instruction sets and thus easier to integrate in TinyGo (and perhaps other tools). It's very well possible that I've overlooked some major issues, but if I were to design Asyncify I would have first tried a design like this.
@kripken I'm curious what you think of this. Is this a reasonable design, did I miss something? Why did you choose the custom buffer instead of re-using the C stack for this purpose?

@kripken
Copy link
Author

kripken commented Jul 21, 2021

@aykevl

Asyncify doesn't use the C stack directly mainly since the C stack isn't part of wasm, it's just a toolchain convention, and some languages or implementations might not have such a thing. So Asyncify just gets a pointer to a buffer for it to use. But you could allocate that buffer on your C stack if that makes sense. One downside to doing so is that then you'd be copying the Asyncify buffer when you copy the C stack (which I assume you need to do with multiple executions in flight).

Overall what you said sounds right to me. In both GC and stack switching the wasm locals need to reach linear memory so they can be scanned.

One option is to have two mechanisms, one for GC, and for stack switching. They would write to different places (wherever the GC one writes to, and Asyncify will write to the buffer you point it to), and there would be some extra overhead due to spilling some locals twice. But that might be fine, and seems simple.

Another option might be to only GC when nothing is running. That is, right before a GC, pause the current execution. Then all the current executions will have been serialized into Asyncify buffers, and you can scan those (conservatively), then GC, then resume execution. (This btw is what emscripten_scan_registers allows in Emscripten - it pauses uses Asyncify and provides a pointer to the serialized buffer.) A downside is that Asyncify will need to instrument more code, and also the work to pause and resume for each GC. But this would be very simple since you wouldn't need a specific mechanism for GC spilling at all.

@aykevl
Copy link
Member

aykevl commented Jul 24, 2021

Asyncify doesn't use the C stack directly mainly since the C stack isn't part of wasm, it's just a toolchain convention, and some languages or implementations might not have such a thing.

Are you saying that Asyncify ignores the C stack pointer entirely? That seems unsafe to me, any program that would use Asyncify and also uses the C stack pointer (which would probably also include Rust and definitely includes TinyGo) would corrupt the C stack when unwinding/rewinding the stack. Maybe not when using it to implement suspending, but certainly when running multiple async tasks at the same time (unwinding one, rewinding the other). Take a look at this innocent-looking piece of code for example: https://godbolt.org/z/qGaczasj4

(This btw is what emscripten_scan_registers allows in Emscripten - it pauses uses Asyncify and provides a pointer to the serialized buffer.)

Wow, that might actually work for the TinyGo GC!

@niaow
Copy link
Member

niaow commented Jul 24, 2021

would corrupt the C stack when unwinding/rewinding the stack

So there are 2 things called asyncify: a transformation pass and a library. The asyncify pass does not care about the C __stack_pointer. Instead, the stack pointer is saved by the function initiating the unwind operation. For example in my implementation of the runtime for tinygo:

https://github.com/niaow/tinygo/blob/a4e95f985a7d9467bfee109ed5a67ffe39eed1c9/src/internal/task/task_asyncify_wasm.S#L18-L25

@niaow
Copy link
Member

niaow commented Jul 24, 2021

Additionally for GC:

The main thing I am concerned about is GC when outside of a goroutine. During the entry point or when handling an event, we spawn goroutines (which allocate memory and can cause a GC). In order for this effectively-system-stack to be scanned properly, we would need to wrap it in a function which can be rewound to, and provide a buffer to rewind into. At the moment it looks like everything should continue functioning if the scheduler and resume() are not scanned, although I don't know whether or not this is something we should maintain in the future. If we are not scanning the scheduler itself, we might want to consider rewinding to the scheduler to run the GC cycle rather than adding an extra special rewind path?

@kripken
Copy link
Author

kripken commented Jul 24, 2021

@aykevl

Are you saying that Asyncify ignores the C stack pointer entirely? That seems unsafe to me, any program that would use Asyncify and also uses the C stack pointer (which would probably also include Rust and definitely includes TinyGo) would corrupt the C stack when unwinding/rewinding the stack.

It is unsafe to ignore the C stack, definitely, for C++, C, Rust, etc. But Asyncify leaves that to the runtime - Asyncify just handles pausing and resuming the wasm stack. A runtime can use Asyncify and in a modular way also handle the C stack if it has one, and other things, on top. Basically like @niaow 's link - nice!

(In theory we could add an option for Asyncify to also handle the userspace stack, but then we'd need to be told how that stack works: which global it uses, does it go up or down, what are the size limits on it, is it contiguous or not, etc. - not everything uses the current LLVM conventions for the userspace stack.)

@aykevl
Copy link
Member

aykevl commented Jul 25, 2021

It is unsafe to ignore the C stack, definitely, for C++, C, Rust, etc. But Asyncify leaves that to the runtime - Asyncify just handles pausing and resuming the wasm stack. A runtime can use Asyncify and in a modular way also handle the C stack if it has one, and other things, on top.

Ah, I see, that makes sense. I didn't get this from the Asyncify documentation. I might have missed this, but if I haven't, it might be worth adding to the docs (especially as it talks a lot about C++ which uses the C stack).

@kripken
Copy link
Author

kripken commented Jul 26, 2021

@aykevl

Good point, thanks, I should update the docs. Did you mean these, or these, or someplace else that you were looking at?

@deadprogram deadprogram added enhancement New feature or request wasm WebAssembly labels Sep 5, 2021
@deadprogram deadprogram added the next-release Will be part of next release label Nov 14, 2021
@deadprogram
Copy link
Member

I think the request was fulfilled in the new release, so closing. Please reopen if needed. Thanks!

@deadprogram deadprogram removed the next-release Will be part of next release label Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wasm WebAssembly
Projects
None yet
Development

No branches or pull requests

4 participants