-
Notifications
You must be signed in to change notification settings - Fork 696
Augment load/store instructions w/ index and scale #1277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was recently looking at dumps of some array heavy code (output of LLVM -O3 and Binaryen) and it was dominated by the above code pattern, doing the same shift/add again, and again, and again.. Besides code size benefits, there are benefits for base-line compiling engines (detecting the above pattern and then using the shift built-in to an x86 or arm addressing mode is not trivial, whereas doing that for the proposed load is). Then there are minor benefits to producers, and people reading wasm :) Having the stack signature change based on an immediate is not ideal, but then again I think adding a parallel set of load instructions is probably worse. Another common pattern I see in code a lot that is worth thinking about: |
After some more discussion: rather than encoding the scale as a shift (which can only represent powers of 2), you can encode a value using a format similar to IEEE 754 floats:
Where We can borrow from the IEEE 754 spec and treat You can calculate these values using the C expression: (eee == 0 ? ss : ((4 | ss) << (eee - 1)) This creates the following scale values: 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384, 448 |
Glad to have someone look into this! One important thing to consider is the performance on a 64-bit implementation using signal-handling to perform bounds checking: Currently, one can reserve (4gb + K + ε) guard pages so that a wasm So this means we need to consider the performance on trapping systems that don't have enough vmem to handle
I haven't studied the problem enough to know what the net effect of non-wrapping semantics would be (well-predicted branches are cheap; hoisting is still somewhat possible; etc), but it's at least conceivable that |
Yeah, that's a good point. I thought there was a reason we didn't do this way back when :-) We could consider specifying wrapping behavior for this -- though it would be pretty weird to have different behavior from regular memory acceses. OTOH, since OOB access is defined to trap, we could decide to loosen behavior to say that all memory accesses have wrapping behavior. That's a much more invasive change, of course. |
Wrapping or non-determinism definitely sound worse and also may be something we'd regret in the future: my hope is that one day in the far future we'll get to use something more like x86 segmentation (where there are [base,limit] registers that are checked after effective address computation), which would avoid the need for vmem reservations altogether (and also give wasm full use of the (base+index*scale+disp)). |
Agreed, though it's also a bit unsatisfying to limit functionality of wasm based on a specific optimization that some implementations might never use. One alternative that came up w.r.t. wasm64 is using a separate memprotected region to perform the "bounds-check". Assuming 4k page size, you could reduce virtual memory requirements by a factor of 16. We could also add large wasm pages, and require their use for this feature to potentially save more here. Not sure if that's worth the extra load for every memory access (and additional register pressure), though. |
Something some Forth implementations do is strength reduction in peephole optimisation. They slide a window a few instructions wide (a peephole) over the incoming instruction stream and replace sequences like "LIT 1 ADD" with "INC" (not a normal Forth word). Generally it's good not to discard information during compilation, but for these kinds of patterns there are some advantages in leaving it to a peephole optimisation during load time, some of these are:
Not saying this is right. just some points for keeping it more RISC-like. |
Streaming compilers can generate good code for this by using thunks, which are a more general-purpose method to emit fused instructions (for example, using I support this since it means that pretty much any memory access requires fewer instructions but any baseline compiler that's serious about performance can handle this, it doesn't strictly increase the expressiveness or optimisability of Wasm, it just makes it easier to emit better code for this specific case. |
Instead of overloading the load instruction with ad-hoc semantics we could go half-way and introduce an address calculation instruction that performs the necessary operation:
Furthermore it seems that the
|
If you're doing that why not just have a FMA instruction that takes 3 params from stack, it doesn't help performance on even the simplest of compilers to enforce that the scale is static. |
"The simplest of compilers" will not even try to track operand values and may be helped by having the scale in the instruction instead of having to discover that the value is constant. And then there's the option of folding more operations that are typical to address computations into the instruction, if we think compilers can make use of them (open question), so FMA may not be what you want in that case. Anyway the original motivation seems to be compactness of representation more than anything. It would be useful to see some plausible data on how much we'd actually save on the various options for encoding. |
I like this idea; the wrapping behavior is natural and expected in this case.
Yes, Wouter and I originally discussed it because he was seeing this operation a lot in his generated code. But it would be interesting to see how often it occurs in larger modules too. I starting looking at few examples the slow way (grepping and eyeballing the surrounding instructions), and found that it's common, but not overly so. Often it seems the scale is lowered to an add for each iteration of a loop, as you'd expect. |
I wrote a tool to read through all instructions in a wasm binary and look for common instruction sequences. It also changes all indexes to 0, and all memory offsets to 0 too. I took some wasm files that I had locally, and looked for an instruction sequence like: Even so, you can see that the equivalent of the tanks (5,119,121 total instructions):
banana bread (924,654 total instructions)
ammo (bullet physics) (312,299 total instructions)
doom3 (1,953,580 total instructions)
pspdfkit (3,417,737 total instructions)
webp encoder (73,089 total instructions)
mozjpeg encoder (77,684 total instructions)
optipng encoder (119,159 total instructions)
UE4 demo (15,994,748 total instructions)
Pyodide (3,812,914 total instructions)
|
That is great data. It certainly speaks for an So yeah, |
Chatting w/ @aardappel offline, we realized that a lot of these modules are older, and are optimized differently. Here's some data from newer modules. It seems to have similar distributions, though perhaps the sequences are less common (this is likely because newer modules seem to be moving the constant base closer to the add; see x86 emulator for example). edit: actually, it seems that Godot is pretty old too (2017). Godot engine (6,641,590 total instructions)
x86 emulator (163,474 total instructions)
makepad (345,687 total instructions)
sandspiel (33,841 total instructions)
|
Are these all generated from the same original language & toolchain. We may find that this changes when more languages and toolchains generate wasm. |
Yes, primarily fastcomp (asm.js) + asm2wasm, or LLVM + binaryen. Agreed it would be good to look at other toolchains, do you have some good examples I should try? |
@KronicDeth here's the tool I was using to generate this data: https://github.com/binji/wasp#wasp-pattern-examples |
@binji I don't have any myself. Sorry. Just a general wish that there should be more. mumbles something about FP and other paradigms using WASM sometime. |
Uh oh!
There was an error while loading. Please reload this page.
@aardappel and I were discussing this briefly today:
If you were to compile the following array access:
in current wasm, you have to generate something like this:
Since this is such a common operation, it would be nice to have a dedicated instruction for this. We could encode an immediate scale value by using unused alignment bits, since the current spec requires that the alignment value is not larger than natural alignment for the access width.
So for
i32.load
, we could encode the new alignment immediate as follows:Where
aa
is the alignment value andsssss
provides a constant shift value in the range [0, 32).This scale is not very useful without a base value, so providing a non-zero
sssss
value would change the stack signature of the load from[base i32] -> [i32]
to[base i32, index i32] -> [i32]
.(We may instead want the scale to be
sssss - 1
, so we can performbase + index*1
and becausesssss == 31
is not very valuable.)With this new instruction, we can generate a nicer instruction sequence:
saving 4 bytes. It has some other nice benefits too: it should be easier for baseline compilers to optimize this instruction. We can also require the address calculation to be non-wrapping, which the previous instruction sequence can't guarantee (and would be difficult to provide).
Thoughts?
The text was updated successfully, but these errors were encountered: