Skip to content

[Wasm GC] Add a GC-Lowering pass which lowers GC to MVP #4000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 198 commits into
base: main
Choose a base branch
from
Open

Conversation

kripken
Copy link
Member

@kripken kripken commented Jul 17, 2021

This converts e.g. a struct.set to an appropriate write to linear memory, etc.
The hard part is implementing things like RTT semantics with casts and
subtyping etc., which requires a runtime to be added there.

The layout of things in linear memory is pretty simple, and outlined at the
beginning of the pass code in a comment.

This is useful for performance evaluations of wasm GC (as it allows the wasm
to be compiled through LLVM, for example) as well as functioning as a polyfill
for wasm GC. The latter almost allows a language today to use wasm GC and to
compile down to MVP wasm for now until VMs implement it, except for the
collector not doing actual collection yet (that could be added later if there is
interest).

Tested on the existing wasm GC benchmarks from Dart. Not fuzzed yet.

Apologies for the size of this PR, but it's the minimal amount of code that
is actually usable and testable...

@kripken kripken requested review from tlively and aheejin July 17, 2021 00:09
@kripken
Copy link
Member Author

kripken commented Jul 19, 2021

@tlively looks like the --help lit tests don't auto-update. is that intentional? I see they are .test and not .wast so I assume there is a reason?

@tlively
Copy link
Member

tlively commented Jul 19, 2021

Right, we would need a separate update script to auto-update those tests because their output is not Wasm. Writing a script that maintains the current factoring of shared options into different files might get complicated, so I thought that maintaining the tests by hand would be fine for now.

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not nearly finished reviewing, but here are some early comments.

// | ptr* | List of types. Each is a pointer to the rtt.canon for the |
// | | type. In an rtt.canon, this points to the object itself, |
// | | that is, we will have ptr => [kind, 1, ptr]. An rtt.sub |
// | | copies the list of the parent, and appends the new type at |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, rtt.canon does not contain any supertypes, so wouldn't this list be empty rather than point to itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll clarify the comment. RttSupers only contains the parents, but here we must contain the full list, as we have no extra field to store the new type like Literal has (the "type" field there). So the list here is never empty.

Comment on lines 86 to 90
Expression*
makeSimpleSignedLoad(Expression* ptr, Type type, Address offset = 0) {
auto size = type.getByteSize();
return makeLoad(size, true, offset, size, ptr, type);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is never used. Could we remove it and rename makeSimpleUnsignedLoad to makeSimpleLoad? I don't think the sign matters anyway if the load is the full width of the resulting data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done.

}

// Make a constant for a pointer value. This handles wasm32/64 differences.
Expression* makePointerConst(Address addr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be methods on the standard Builder class? I image that as we do more with 64-bit memories, this kind of thing will become more common.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I'd suggest leaving them here for now, and adding a pointer-builder.h header eventually when we find the need to use them elsewhere.

}

// Null-check a pointer.
Expression* makePointerNullCheck(Expression* a) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makePointerIsNull to clarify that it will return true if the pointer is null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it's ambiguous as it is. Done.


// Record the original types of things, which may be needed later.
if (type.isRef() || type.isRtt()) {
originalTypes[getCurrentPointer()] = type.getHeapType();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getCurrentPointer() the same as curr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, getCurrent() gets curr, effectively, while getCurrentPointer gets the pointer to curr.

@bashor
Copy link

bashor commented Oct 28, 2022

@kripken could you please rebase it to the main branch?

@kripken
Copy link
Member Author

kripken commented Oct 28, 2022

@bashor That would be a large effort, I'm afraid, as the spec has changed a lot in the last 1.5 years. The spec is still changing, also, so I think it would be best to wait for it to fully stabilize to not do a lot of redundant work here.

Also, note that this does not actually perform GC - it puts data in linear memory but does not collect, nor does it have logic to scan roots. Those could be added but it is another large chunk of work. (Though, if we wait then wasm may add a feature for root scanning at least.)

Do you have an urgent need for this?

@bashor
Copy link

bashor commented Oct 31, 2022

The spec is still changing, also, so I think it would be best to wait for it to fully stabilize to not do a lot of redundant work here.

It feels to me like it's already stable enough.

Also, note that this does not actually perform GC - it puts data in linear memory but does not collect, nor does it have logic to scan roots.

It's already something to start with :)

Do you have an urgent need for this?

Well, we consider ways to try Kotlin/Wasm outside of browsers and it seems like the simplest way for now.

@mraleph
Copy link

mraleph commented Nov 8, 2022

Just leaving a note here that we are similarly interested in this from Dart2Wasm side. It's not pressing, but might be an interesting fallback for uses outside of the browser.

@kripken
Copy link
Member Author

kripken commented Nov 8, 2022

Good to know this would be useful. From my side, I intend to get to it once the spec and binaryen's implementation of it is stable (to avoid wasted work), and once performance is in a better place (which is what I'm focused on now).

Note though that getting this to full production-ready state would require tracking locals on the stack and adding a mark-sweep implementation. But leaking memory (as the PR does now) should be enough to unblock experiments in this space.

@dcodeIO
Copy link
Contributor

dcodeIO commented Nov 11, 2022

Perhaps as a data point for a potential minimal integration story, here's what I could imagine: When building with GC lowering, instead of compiling a fresh module, construct from an existing module containing the "runtime". The runtime provides the necessary integration points for the lowering pass:

  • An alloc(size, id) function that is called with
    • an array's or struct's byte size
    • a unique id of the heap type
  • A getid(ptr) function to obtain the unique id (given to alloc) of an object according to the runtime's memory layout of GC objects.
  • A link(parentPtr, childPtr) function that is called when a parent-child relationship is established (say: parent.foo = child, where both parent and child are GC-typed), serving as an integration point for a GC that utilizes a write barrier.
  • A visit(ptr) function for visiting (marking) a GC object.
  • A heap_base global (for the pass to amend). Original value is where the lowering pass puts static information it requires, before amending the global's value to point behind it, becoming the new heap_base.
  • A stack_size global or pass argument, for a shadow stack region. Could either place the stack at the start of linear memory by convention, in a second memory, or insert at heap_base and amend again.

The pass would then additionally generate:

  • Spilling of GC-typed locals (pointers) to the shadow stack.
  • A visitGlobals function calling the runtime's provided visit for each GC-typed global.
  • A visitStack function calling the runtime's provided visit for each live shadow stack item. Can perhaps be merged with visitGlobals to become visitRoots.
  • A visitObject function switching over every possible heap type (by unique id), that is aware of each structs GC-typed fields, respectively arrays element type, and their offsets relative to the struct's or array's address after lowing, calling the runtime's provided visit function for the object and each GC-typed field or element with their respective pointer.

With this in place, an incremental GC could step when, say, alloc is called. Allocations are under the control of the linked runtime, starting at heap_base. Marking starts with visitGlobals and visitStack and the runtime can traverse from there by (incrementally) calling visitObject on what it finds to be reachable. Sweeping can then free any object that was alloced but didn't get touched by visit. Pinning, if necessary, can be implemented on the runtime side.

From the top of my head while looking at GC we use, but perhaps that's already useful :)

@kripken
Copy link
Member Author

kripken commented Nov 11, 2022

Thanks for the feedback @dcodeIO !

It does seem like if we want to integrate with incremental GC and write barriers then we'd need a fairly comprehensive "runtime" layer like that. Maybe it makes sense to do.

I was hoping the runtime could be simpler, though, since my hope is that wasm GC would be the fast version while the lowered MVP version could be slower since it's a fallback while VMs work to implement wasm GC (which is hopefully not for long). Given that, I was hoping incremental GC and write barriers etc. would not be needed in the MVP version. The runtime would then include something like alloc/free for getting space for GC objects, but it would leave mark/sweep to LowerGC (which would not be super-efficient). But, those are just some general thoughts, I don't have any full design in mind.

@tlively
Copy link
Member

tlively commented Nov 11, 2022

If we had wasm-merge functionality, we could merge in arbitrary runtime modules provided by us or provided by users with more specific needs. cc @ashleynh

@dcodeIO
Copy link
Contributor

dcodeIO commented Nov 12, 2022

Made such a runtime (variant with language-provided alloc/free) to get an initial idea. 3KB, no memory or table on its own, incremental-capable, the start function can probably also be refactored away at the cost of a branch. The MM is a variant of TLSF, the GC is tri-color mark and sweep. Haven't tested, though, might or might not be functional already (what's alloc above is __new here). Perhaps that helps to judge complexity :)

@kripken
Copy link
Member Author

kripken commented Nov 28, 2022

This came up in an offline discussion today. Thinking about speed, it seems that even a simple wasm VM implementation of GC could be much faster than this polyfill (due to things like scanning the stack, etc.). Given that, it seems like the polyfill would only help for cases where speed doesn't matter too much.

In the discussion I joked that we could compile a wasm GC VM to wasm to run GC on VMs without GC. But maybe that actually makes some sense? If we don't care about speed, and just want a way to run the code, then that could be easy and good enough. This could use spidermonkey.wasm or wasm3 or something else.

@mraleph
Copy link

mraleph commented Nov 28, 2022

. Thinking about speed, it seems that even a simple wasm VM implementation of GC could be much faster than this polyfill (due to things like scanning the stack, etc.).

I think predicting actual speed is hard here. GC also has to scan the stack and compiled Wasm+GC code similarly has to spill live values on the stack across calls, so it's unclear to me if the difference is going to be all that bad.

On the other hand lowered Wasm can probably skip some of the checks that Wasm+GC needs to do to satisfy the type system.

@kripken
Copy link
Member Author

kripken commented Nov 28, 2022

@mraleph Ah, good point that the lowering can be a little unsafe where it makes sense. That would work in the other way and make it potentially faster.

CountBleck added a commit to CountBleck/binaryen that referenced this pull request Aug 10, 2023
Apparently WebAssembly#4000 did the exact same thing with the
exact same name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants