-
Notifications
You must be signed in to change notification settings - Fork 2
Langues where strings are primarily UTF-8 #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, I think that Rust will continue to use its own linear memory for the foreseeable future. The benefit of this proposal (with regards to Rust) is at the boundaries: you can compile Rust code to Wasm, and that Wasm module can export Similarly, if the Rust code imports another Wasm module, it can easily convert the Similarly, it makes it a lot easier for Rust code to send strings to / from JS, because a JS string can be a Performance-wise it should be similar to what we have today, but it improves interop and composability. |
I'm not sure that's any better than the current state of interop of Rust with other code on the Web today. Today you already can create a JS string from Rust string and use it on the boundary as an Outside of the Web, you need some larger interop story to handle more data types than strings, such as records/arrays/etc. If you're using the component-model, that allows you to pass strings between components without stringref. |
That is already covered in the overview. Performance-wise, that always forces 2 conversions to / from a JS string, with full For example, imagine two Wasm modules, both of which were compiled from Rust. Because both modules were compiled from Rust, they both internally use UTF-8 String. When linking those two Wasm modules together, the Wasm compiler can notice that fact and then optimize it so that it just copies the raw bytes directly from one linear memory into the other linear memory, without any transcoding. That means it just needs to do 1 very fast memcpy, instead of allocating a JS string and doing 2 Consider this other example: imagine two Wasm modules. One of those Wasm modules is compiled from Rust, and the other Wasm module is compiled from a UTF-16 language (like C#). With your approach, it would need to heap-allocate a JS string, do an However, with Consider this third example: imagine a Wasm module is compiled from Rust. That Wasm module calls host APIs with strings (like browser DOM APIs). With Those sorts of optimizations can't be done with Also, this proposal is not only for UTF-8 languages, it's designed to accommodate many languages. Languages which rely on specific memory layouts (like Rust or C++) won't benefit as much, and that's okay.
That is handled by other proposals, this proposal is only for strings. |
I may be missing something, but if you’re compiling and linking two rust modules, why would you need JS involved at all? But assuming that you do, stringref won’t help here. The first module would create a stringref from utf8 in its linear memory, and then the second would encode that into its own linear memory. In SpiderMonkey as noted above, both of these steps will involve a transcode between UTF8 and WTF16. That’s the same situation as with JS String and externref. Theoretically, SM could gain WTF8 support for strings someday in the future. But in this case, we will still need to materialize the intermediate stringref and cannot just copy from one linear memory to the other directly.
Again, not sure how to avoid the copy required to materialize the intermediate stringref value that will be consumed by Java. Unless you’re assuming some sort of whole program analysis?
Sure, I’m just trying to understand how UTF-8 focused languages are expected to benefit from this as that seems like one of design goals. |
There are plenty of reasons why you might want to dynamically link Wasm modules together, instead of statically linking Rust crates together.
See the old interface types proposal: When linking Wasm modules together, the Wasm compiler has full knowledge of adapter functions, so it can inline and optimize them. That means the Wasm compiler can remove redundant copies and unnecessary transcoding. This does not require whole program analysis. Although adapter functions are not currently a part of any proposals, they are an example of the kind of optimizations that |
I agree with @Pauan that linear-memory languages like Rust are highly unlikely to use stringref as their general-purpose string representation, because stringref provides garbage-collected strings. As discussed above, such languages may nevertheless have certain boundary-related use cases where stringref offers benefits. So the examples to look for would be managed languages with UTF-8 strings. According to Wikipedia these may include e.g. Go, Julia, PyPy (and Swift, if someone decided to compile it to WasmGC instead of ARC'ed linear memory). Generally speaking, any such language would use stringview_wtf8 just like other languages use stringview_wtf16: whenever they need to perform an operation on a string that wants to be able to assume that the string is encoded in wtf8/utf8. And even if a given toolchain hasn't learned that trick yet, it's (admittedly a bit of legwork but) not overly difficult for an optimizing compiler in an engine to hoist view creation out of loops. V8 already supports that for I think the fact that managed-memory UTF-8 based languages so far aren't strongly represented in the world of WasmGC matches (not by coincidence!) the fact that current engines don't have great UTF-8 optimizations. I expect what will happen is that over the years partnerships will emerge (I don't know which) where toolchains and engines work together to solve this chicken-and-egg problem and bring additional languages to Wasm. (As you're probably aware, some of these languages, such as Go, have additional requirements that the WasmGC MVP isn't providing, so there will be more need for collaboration anyway.) So I'm not worried about the fact that current engines don't yet have highly-optimized support for UTF-8 strings/string_views, and I don't think they need to be in a rush to build it. I think it's a strength of the stringref proposal that it lays the spec-side groundwork for a future where there's more than WTF-16: when UTF-8 based languages become more popular as sources for Wasm modules, we won't have to change the spec; we'll only have to then address the |
What tool chain are you using to dynamically link random wasm modules together? You need some ABI to decide how values from different languages are passed around. If it’s rust modules, you’ll use the rust tool chain and that will not use stringref as part of its internal ABI. If it’s C++, it’ll be a similar situation. There is no toolchain that supports dynamic linking of Rust and C#, except possibly the component-model. And there as noted before, stringref does not give you anything extra for linear memory languages.
The key part of adapter functions which enables that optimization is that adapter functions could match a single lift with a single lower and fuse them to omit the temporary value that would be required. Stringref is not required for that. |
As discussed above, I don’t think there are boundary related benefits above the current state-of-the-art on the Web or off-the-Web.
Do you have any data on a managed language with UTF-8 strings using this proposal? My read from your comment is that it’s expected that for these languages there will be a copy+transcode every time they access their strings, with possibly some optimizations to common up this work in certain cases. I think the default expectation should be that this will have very poor performance, but I could be convinced otherwise if there was data to the contrary. And I understand that in the future engines could add WTF-8 representations to speed this up and reduce memory usage. However, I would suggest removing these instructions from the proposal until if and when this optimization is generally available. It doesn’t make sense to have WTF-8 support in this proposal if languages will not use it until some future optimization. Codifying it in the spec now would make it harder to make changes if we deemed it necessary when adding WTF-8 support. |
There are multiple ways to link Wasm modules together, such as Wasmer, or linking the Wasm modules in the browser using the JS APIs. Eventually esm-integration will make linking Wasm modules much easier. There is even an entire ecosystem of self-contained Wasm modules which are intended to be linked together. And there is a push for using Wasm in serverless computing (e.g. AWS Lambda and Cloudflare) and also in cryptocurrency. Those also benefit from dynamically linking Wasm modules. But that's getting very off-topic. Regardless of your opinion on it, some people do dynamically link Wasm modules together, and that is a use case that the WasmWG intends to support.
WASI is an ABI that can be used for Wasm module communication. And the component-model proposal (previously interface-types) is also intended to create an ABI for cross-module communication. Of course people can create their own ABI as well (e.g. cryptocurrency creates their own ABI for Wasm modules).
The goal is to have many different lift / lower instructions, including a "lift / lower from UTF8" instruction, which would copy UTF8 bytes from linear memory into a So if the Wasm compiler sees a "lift from UTF8" instruction followed by a "lower from UTF8" instruction, then it can fuse them together and avoid the transcoding and intermediate |
My point wasn't that there is no value in linking together code (or in dynamic linking). My point is that to link any code together they need to share the same ABI. You cannot link completely random wasm modules together. For linear memory languages there are three options I know of:
None of these benefit from stringref for linear memory languages.
stringref is not in those proposals. The component-model (and interface-types before it) uses it's own string type for communicating between components. The component-model could be extended to use it in the future, but as noted above, this doesn't gain anything for linear memory languages. |
@eqrion It seems there's some sort of misunderstanding here, so I'll try to clarify as best as I can... Many different languages compile to Wasm. Most of those languages want GC strings, so they do not want to put strings into linear memory. That includes languages like Java, C#, Python, Go, etc. Some languages however don't want GC strings, they do want to put strings into linear memory (Rust, C++, etc.) In addition to that, languages have different string encodings and representations. C++ has NUL terminated strings, Rust does not. Some languages use UTF-8, some use UTF-16, some use WTF-16, etc. In addition to that, Wasm modules need to interop with the host. That host could be the browser / JS (which uses WTF-16 GC strings), or it could be something else entirely (Wasmtime, Wasmer, etc.) One of the goals of Wasm is to allow different Wasm modules to interop with each other, regardless of their source language, and regardless of their internal representation. That means we need a string ABI which can accommodate as many languages as we can, and can also accommodate many different hosts, and can also accommodate both GC and non-GC strings, while still being efficient. The current component-model string is an MVP. That means it is intentionally not designed to solve the problem of universal string interop. It's just the simplest thing that works right now. In particular, the component-model string is always USVString, which does not work for interop with the host, and it also doesn't support GC strings either. Many Wasm proposals are like that: they do the minimum necessary to get the proposal working, but they leave room for future improvement. However, in the long term the component-model string is not good enough. We need a string type which can work both for GC languages and non-GC languages, and it must also be able to fully interop with the host as well. That's where
However non-GC languages still benefit, because they are now able to seamlessly interop with any other Wasm module, including modules that use GC. Let's say that I create a Rust Wasm module. I then publish that Rust Wasm module as a library, so other people can use it. My Rust Wasm module might be linked to any other Wasm module. Since it's a library, I do not know ahead of time which modules it will be linked to. If my Rust Wasm module uses In the ideal case where the linked modules have the same string representation (e.g. UTF-8) then the adapter functions will be optimized to remove the redundant transcoding. In the less ideal case where the linked modules don't have the same string representation, then there is a performance cost, but at least it still works, because both modules are using So we get universal string interop regardless of the source language, and regardless of the host, and the performance is optimized. This is something that the component-model string cannot do, and |
Where are you seeing this as a goal of WebAssembly? I don't see it on the listed high-level goals. This looks like more of a goal of the component-model, which is a separate layer from wasm.
Languages using Wasm-GC can already store them in arrays of i8 or i16. From the above discussion, it sounds like linear memory languages won't use stringref internally, and I'm also unsure if they would use them as part of their ABI instead of externref.
My point above about ABI is that you cannot publish a Rust wasm module to be linked with any arbitrary other wasm module from possibly a different language. Strings are a small part of the Rust ABI, you would also need to defined structs, arrays, tuples, enums, references, pointers, etc. You need every detail to line up for linking to work. The only proposal I know of that is tackling this issue is the component-model and that should not be confused with the core wasm instruction set. |
I want to refocus this issue on UTF-8 users of this proposal, so going back to my earlier point: Do we have any data on a managed language with UTF-8 strings using this proposal? My concern is that for languages running in engines without native UTF-8 string representations (all of them in the near to medium term), there will be very frequent copy+transcode operations as they'll need to reacquire the non-native utf8-view for every operation and access. I understand we could sometimes common up acquiring the views using local function optimizations, but I don't think that will be enough to have good performance. |
I'm not aware of any concrete data so far; the Scheme-to-Wasm compiler that @wingo is working on is probably closest to being able to generate such data, but I don't know what "closest" means in actual calendar terms.
It's not every operation. Specifically, the following operations don't need to acquire a utf8-view (regardless of whether each of these is an instruction or an import):
Whereas this is the list of operations that do need to acquire a utf8-view:
This is why I'm not worried that performance will be at least "okay" right out of the box, even on engines without UTF-8 optimizations. |
This relies on having these operations be implemented in the host (as instruction or import) where it can use the native encoding format directly. For UTF-8 specific operations that aren't standardized across hosts, these will need to be emulated in wasm code. In some cases they may be able to use the iterator interface, but not if their source code is written to use indices into the raw bytes. Using indices to access strings is pretty common in standard libraries [1], and I would guess it also happens in user code too. It's theoretically possible this could have acceptable performance, but I think there are very good reasons to default to thinking that representing a source languages string in a non-native encoding won't work. It just takes a loop, a large string, and indexing operation on it to get a ton of copies+transcodes. And if I was a source language compiling to wasm, I would want control over this to prevent this from happening and would avoid using stringref if this was a possibility. [1] https://cs.opensource.google/go/go/+/refs/tags/go1.20.5:src/strings/strings.go;l=1049 |
I strongly believe that you generally want all performance-critical "bulk" operations (processing an entire string at once) to be implemented in the host, because that makes them much faster. This is very similar to the memory bulk operations.
Of course you wouldn't want "a ton of copies+transcodes" for a single loop. That's why this proposal offers view creation as an explicit step, so you do have control to do that just once, before entering the loop. (And in the fullness of time, engines will turn even that into a no-op.) The proposal also offers the ultimate escape hatch of converting strings to arrays (and back), for arbitrary manipulation.
The specific Go loop you linked to would be well expressible in a direct translation, in particular all the indexing it does:
(FWIW, I do think that |
Memory bulk operations have a nice complexity/benefit payoff in my opinion. They're simple loops of loads/stores and are extremely common and hot in programs. String operations (assuming things like toUpper/trim/split) are much harder to specify as there are an order of magnitude more of them across different languages (with incompatible variants of the same concept). And it's unclear to me how much better a host string 'trim' method could be over a wasm string 'trim' method to justify the complexity.
I think the problem I'm getting at is that the Go string type would need to be a If your source language compiler can reliably hoist the |
Go may not be a good example here, their string type is just bytes with the encoding given to it by each operation that accesses it. So it's not clear to me that they would use stringref. |
If the X-to-Wasm compiler messes it up and produces a suboptimal module, then there's still a good chance that a sufficiently smart engine will save the day, by doing the hoisting engine-side. This isn't specific to UTF-8, or stringref, or even Wasm! Even in the existing case of JS strings in a JS engine, you don't want to check "is this string a rope that needs to be flattened?" in every iteration of a string-indexing JS loop. You want to perform that check only once, before the loop. JS doesn't have a way to express that in the language, so engines have no other option but to do it automatically under the hood. If an engine can do that for JS strings, then the same technique can make string view creation acceptably efficient in cases where the engine doesn't have a native string representation matching the requested view. So compared to the status quo in JS, the stringref proposal's concept of views improves the situation (in typical Wasm fashion) by (1) giving module producers more control and (2) making engines' lives easier. |
I'm trying to figure out how languages that primarily use UTF-8 for their strings would use this proposal.
The first example that comes to mind is Rust, however a Rust
String
(which exists in linear memory) can be coerced to&str
and so neither type can be transparently astringref
. So you'd need to either: (1) rewrite code to use aWasmString
type or (2) copy on the boundary into linear memory. (2) isn't really different from what we have today from what I can tell.Thinking about (1), I'm skeptical that code is going to rewrite to use it, but assuming that they do I'm not sure how it would utilize this proposal.
My best guess is they'd:
stringref
so that you can use string.eq/concatstringview_wtf8
whenever an accessor is called (like indexing)The concern I have with this is that SpiderMonkey wouldn't be able to store WTF8 contents inside our stringref for the medium-term future. So every single accessor call, like indexing would force a transcode from the stringref to the view.
Maybe you could make WasmString cache the wtf8_view lazily so it can re-use the view from a previous accessor call? But then strings would have twice the memory overhead.
Am I missing something? I also would be interested in other languages, but my mind is coming up blank.
The text was updated successfully, but these errors were encountered: