Too many raw_bitcasts in SIMD code #1147

abrown · 2019-11-01T23:54:45Z

What is the feature or code improvement you would like to do in Cranelift?

During translation from Wasm to CLIF, a combination of Wasm's v128 type and Cranelift's current type system forces us to add many raw_bitcast instructions between operations. For example, this Wasm code:

  (func (export "add-sub") (param v128 v128 v128) (result v128)
    (i16x8.add (i16x8.sub (local.get 0) (local.get 1))(local.get 2)))

Translates to this CLIF code:

function u0:4(i64 vmctx [%rdi], i8x16 [%xmm0], i8x16 [%xmm1], i64 fp [%rbp]) -> i8x16 [%xmm0], i64 fp [%rbp] system_v {
    ss0 = incoming_arg 16, offset -16

                                ebb0(v0: i64 [%rdi], v1: i8x16 [%xmm0], v2: i8x16 [%xmm1], v12: i64 [%rbp]):
[RexOp1pushq#50]                    x86_push v12
[RexOp1copysp#8089]                 copy_special %rsp -> %rbp
@00a6 [null_fpr#00,%xmm0]           v4 = raw_bitcast.i16x8 v1
@00a6 [Mp2vconst_optimized#5ef,%xmm2] v11 = vconst.i16x8 0x00
@00a6 [Mp2fa#5f9,%xmm2]             v5 = isub v11, v4
@00a6 [null_fpr#00,%xmm2]           v6 = raw_bitcast.i8x16 v5
@00aa [null_fpr#00,%xmm2]           v7 = raw_bitcast.i16x8 v6
@00aa [null_fpr#00,%xmm1]           v8 = raw_bitcast.i16x8 v2
@00aa [Mp2fa#5fd,%xmm2]             v9 = iadd v7, v8
@00aa [null_fpr#00,%xmm2]           v10 = raw_bitcast.i8x16 v9
@00ac [-]                           fallthrough ebb1(v10)

                                ebb1(v3: i8x16 [%xmm2]):
@00ac [Op2frmov#428]                regmove v3, %xmm2 -> %xmm0
[RexOp1popq#58,%rbp]                v13 = x86_pop.i64 
@00ac [Op1ret#c3]                   return v3, v13
}

This issue is to discuss if and how to remove these extra bitcasts.

What is the value of adding this in Cranelift?

The extra raw_bitcasts emit no machine code but they are confusing when troubleshooting and add extra memory and processing overhead during compilation.

Do you have an implementation plan, and/or ideas for data structures or algorithms to use?

Some options:

add types to load and const: Concerns about integer vs floating-point instructions on x86 WebAssembly/simd#125 was discussed in the Wasm SIMD Sync meeting (SIMD Sync meeting 10/22/2019 Agenda WebAssembly/simd#121) and someone brought up that making load and const typed (e.g. f32x4.load) would allow compilers to attach the correct types to values and retain them through the less-strong v128 operations (e.g. xor). Concerns about integer vs floating-point instructions on x86 WebAssembly/simd#125 discusses this from a performance point of view but that addition would solve this issue.
examine the DFG: another approach would be to look at the DFG to figure out the types of predecessors as mentioned in Initial 128-bit SIMD proposal WebAssembly/simd#1 (comment). This, however, would have to be extended for type signatures. Cranelift would have to look at the instructions in a function to figure out how the v128 parameters are used. In the function add-sub above, with signature (param v128 v128 v128), the addition and subtraction make this clear but some functions will make this analysis impossible.
add a V128 type to Cranelift: Cranelift's type system could be extended to include a V128 type in Cranelift's type system that would include all INxN, FNxN, and BNxN types. The instruction types would stay the same (e.g. iadd should still only accept integers) but type-checking could be relaxed to allow the V128 type to be used as one of its valid subtypes. This opens up a mechanism to get around the type-checking but arguably that already exists with raw_bitcast. Code that knows its types would remain as-is but Wasm-to-CLIF translated code could use the V128 a bit more naturally than the raw_bitcasts.
do nothing: I brought this up a long time ago when talking to @sunfishcode and that seemed the best thing to do then--I'm opening this issue to discuss whether that is still the case.

Have you considered alternative implementations? If so, how are they better or worse than your proposal?

See above.

The text was updated successfully, but these errors were encountered:

bnjbvr · 2019-11-04T12:36:14Z

Thanks for opening an issue. It seems a bit weird that the translator adds the raw_bitcasts v6 and v7; this might be a bug in the translator, or something else.

Does GVN not reduce the number of redundant raw_bitcasts? One way to look at this would be to see the function's CLIF after it's been optimized.

Should we introduce a simplify function that folds raw_bitcast.T(raw_bitcast.S(x)) with x of type T into x, to eliminate spurious bitcasts?

For simple return values, probably we could just not introduce a raw_bitcast when the return type is a single type? (There could be different return (sub)types because of control flow, in which case they still need to be unified to a single type using raw_bitcasts.)

abrown · 2019-11-04T21:29:28Z

It seems a bit weird that the translator adds the raw_bitcasts v6 and v7; this might be a bug in the translator, or something else

I had to add code to do this because when I'm translating a single Wasm operation, I don't know if the next operation will be a return expecting a v128; this narrow-sightedness means that either I have to do this bitcasting upon entering and leaving each operation or modify the return (and fallthrough and jump?) code to bitcast to the correct return type before leaving the function.

Does GVN not reduce the number of redundant raw_bitcasts?

I don't think so; that snippet is (IIRC) from the Compiled: debug line after all the passes.

Should we introduce a simplify function that folds raw_bitcast.T(raw_bitcast.S(x)) with x of type T into x, to eliminate spurious bitcasts?

We could; I guess I was hoping for something more.

For simple return values, probably we could just not introduce a raw_bitcast when the return type is a single type?

Not sure I completely understand. I'll ping you on IRC.

bnjbvr · 2019-11-05T10:03:58Z

or modify the return (and fallthrough and jump?) code to bitcast to the correct return type before leaving the function.

Yeah, this sound a bit better than converting instructions' outputs. It would be more in line with what other instructions do: convert their input in an on-demand, lazy fashion. This means that for a function returning one or several v128, the wasm translator will need to keep track of what the SIMD (Cranelift) subtype is, and mayde add a raw_bitcast then. This actually addresses also my (badly worded, sorry) last point about avoiding to bitcast the return value as much as we can, by unifying the return subtype: before scanning any instruction, the wasm translator would store None as the SIMD return subtype, and on the first return that returns a v128, it'd store Some(actual_subtype); then other returns can optionally introduce a raw_bitcast (to this SIMD subtype) for other returns. Does it make sense?

I don't know how big of an effect this will have to reduce the number of raw_bitcasts, though, so we might need to do more.

Regarding options brought up in the original post:

sounds ideal, since it would also mean using the optimal typed instructions, whenever we have a choice of machine instructions for a single v128 instruction: for instance, mov has the movaps (for FP values) or pmov (? not sure of the name ; for integer values).
Definitely something we could do, but even this could lead to type divergence: say we have an input v128 that's passed to an if / else sequence, then it could be used as an int8x16 in a branch and as a float32x4 in another. (This might be a stupid example advocating for the devil, but any input is possible and must be correctly handled.) In this case, bitcasts are inevitable. It might lower the number of raw_bitcasts on the first EBB. We'd need to make sure this doesn't have ABI implications, though.
It would be the first instance of subtyping in Cranelift's type system. I think we'd lose some instruction selection opportunities by doing so: theoretically right now we could select the right kind of typed instructions (in my above example, the right kind of move) by doing some simple CFG analysis, and operating on v128 at the type system level would preclude this.
It's an interesting question. Is there a quick-and-dirty experiment we could do to find out? That is, for instance, remove all the raw_bitcasts in a big, SIMD-heavy wasm demo, disable all the verifier checks and see what the different in compile time is? (I expect this to have no generated code performance impact at all.)

abrown · 2019-11-06T21:47:17Z

Here's my experiment: I converted all of the SIMD spec tests from https://github.com/WebAssembly/testsuite to WASM files using cd wasm-testsuite/proposals/simd; ls -1 | xargs -n1 wast2json --enable-simd. Of the 220 files produced, I figured out that 95 could compile without failure so I ran these 95 through perf, first with the currently translated bitcasts on add-float-arithmetic:

perf stat -r 5 cargo run -q wasm --target="x86_64 skylake" --set enable_simd --enable-simd -d ../wasm-testsuite/proposals/simd/*.wasm

            604.80 msec task-clock:u              #    0.983 CPUs utilized            ( +-  0.67% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,611      page-faults:u             #    0.006 M/sec                    ( +-  0.05% )
     2,626,614,568      cycles:u                  #    4.343 GHz                      ( +-  0.16% )
     2,347,697,145      instructions:u            #    0.89  insn per cycle           ( +-  0.00% )
       441,279,454      branches:u                #  729.624 M/sec                    ( +-  0.00% )
        17,337,063      branch-misses:u           #    3.93% of all branches          ( +-  0.43% )

           0.61520 +- 0.00404 seconds time elapsed  ( +-  0.66% )

I then added a bunch more bitcasts (e.g. the ones for restoring the type to what the function expected) and re-compiled the 95 tests:

perf stat -r 5 cargo run -q wasm --target="x86_64 skylake" --set enable_simd --enable-simd -d ../wasm-testsuite/proposals/simd/*.wasm

            604.89 msec task-clock:u              #    0.986 CPUs utilized            ( +-  0.18% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,604      page-faults:u             #    0.006 M/sec                    ( +-  0.07% )
     2,644,659,922      cycles:u                  #    4.372 GHz                      ( +-  0.18% )
     2,376,097,212      instructions:u            #    0.90  insn per cycle           ( +-  0.00% )
       446,557,438      branches:u                #  738.249 M/sec                    ( +-  0.00% )
        17,327,277      branch-misses:u           #    3.88% of all branches          ( +-  0.18% )

           0.61330 +- 0.00277 seconds time elapsed  ( +-  0.45% )

I would ignore the times and cycles for now and focus on the ~30M more instructions executed due to adding the bitcasts. That's about 1.2% more instructions (30M/2347M). After adding the bitcasts, I checked again how many of the 220 WASM files would compile and now 115 would do so (previously 95).

I'm still working on a run without the verifier enabled that compares all bitcasts vs no bitcasts at all but am running into compile issues.

abrown · 2019-11-07T01:00:33Z

I'm still working on a run without the verifier enabled that compares all bitcasts vs no bitcasts at all but am running into compile issues.

Here's that comparison: I removed all bitcasting which means types are likely incorrect but this should give a better comparison of the effect of something like point 3 above. When I do this, 98 WASM files compile. I run these with no bitcasting (bytecodealliance/cranelift@afd1761):

perf stat -r 5 cargo run -q wasm --target="x86_64 skylake" --set enable_verifier=false --set enable_simd --enable-simd -d ../wasm-testsuite/proposals/simd/*.wasm

            276.27 msec task-clock:u              #    0.971 CPUs utilized            ( +-  0.19% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,598      page-faults:u             #    0.013 M/sec                    ( +-  0.19% )
     1,170,558,059      cycles:u                  #    4.237 GHz                      ( +-  0.19% )
     1,187,419,076      instructions:u            #    1.01  insn per cycle           ( +-  0.00% )
       220,123,255      branches:u                #  796.780 M/sec                    ( +-  0.00% )
         7,543,437      branch-misses:u           #    3.43% of all branches          ( +-  0.07% )

           0.28463 +- 0.00197 seconds time elapsed  ( +-  0.69% )

And then with all bitcasts (bytecodealliance/cranelift@41f910f):

perf stat -r 5 cargo run -q wasm --target="x86_64 skylake" --set enable_verifier=false --set enable_simd --enable-simd -d ../wasm-testsuite/proposals/simd/*.wasm

            278.25 msec task-clock:u              #    0.964 CPUs utilized            ( +-  0.10% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,601      page-faults:u             #    0.013 M/sec                    ( +-  0.32% )
     1,178,769,313      cycles:u                  #    4.236 GHz                      ( +-  0.10% )
     1,199,736,932      instructions:u            #    1.02  insn per cycle           ( +-  0.01% )
       222,436,305      branches:u                #  799.402 M/sec                    ( +-  0.00% )
         7,580,571      branch-misses:u           #    3.41% of all branches          ( +-  0.07% )

          0.288642 +- 0.000282 seconds time elapsed  ( +-  0.10% )

I see around 1% (12M/1187M) additional instructions when bitcasting. What is deficient in this comparison is that as I look at the 98 WASM files compiled, not many of them are using instructions that would require bitcasts. In fact, 71 of these are from simd_const.wast which I don't see using any operations that would cause bitcasts. In other words, if I could get more spec tests to compile, the difference between bitcasting and not bitcasting might be larger than 1%.

abrown · 2019-11-13T23:08:37Z

You mentioned trying to bitcast before a return (and other control flow changes) over at bytecodealliance/cranelift#1182 (comment):

Could we insert an optional raw_bitcast to the input of the return

I had been thinking about that and actually tried it today, with no luck. In code_translator.rs I altered Operator::Return so that any vectors are bitcast immediately before the CLIF return (bytecodealliance/cranelift@master...abrown:bitcasting). Unfortunately, I had forgotten that Wasm's return is only an early return so in the typical case there would be no opcode indicating that the function is done (the block simply ends). I might try to look into where the block translation ends and attempt the bitcast there but I am leaning toward no. 3 above because:

the 1%+ compiler slowdown doesn't seem like a good direction
the bitcasts are confusing during troubleshooting (e.g. is this failing because I haven't bitcast something to the right type or because there is some actual issue?)
I would have to patch up quite a few different opcode translations in this bitcast-before-return paradigm (i.e. https://webassembly.github.io/spec/core/bikeshed/index.html#control-instructions%E2%91%A0).

* Add x86 encodings for `bint` converting to `i8` and `i16` * Introduce tests for many multi-value returns * Support arbitrary numbers of return values This commit implements support for returning an arbitrary number of return values from a function. During legalization we transform multi-value signatures to take a struct return ("sret") return pointer, instead of returning its values in registers. Callers allocate the sret space in their stack frame and pass a pointer to it into the caller, and once the caller returns to them, they load the return values back out of the sret stack slot. The callee's return operations are legalized to store the return values through the given sret pointer. * Keep track of old, pre-legalized signatures When legalizing a call or return for its new legalized signature, we may need to look at the old signature in order to figure out how to legalize the call or return. * Add test for multi-value returns and `call_indirect` * Encode bool -> int x86 instructions in a loop * Rename `Signature::uses_sret` to `Signature::uses_struct_return_param` * Rename `p` to `param` * Add a clarifiying comment in `num_registers_required` * Rename `num_registers_required` to `num_return_registers_required` * Re-add newline * Handle already-assigned parameters in `num_return_registers_required` * Document what some debug assertions are checking for * Make "illegalizing" closure's control flow simpler * Add unit tests and comments for our rounding-up-to-the-next-multiple-of-a-power-of-2 function * Use `append_isnt_arg` instead of doing the same thing manually * Fix grammar in comment * Add `Signature::uses_special_{param,return}` helper functions * Inline the definition of `legalize_type_for_sret_load` for readability * Move sret legalization debug assertions out into their own function * Add `round_up_to_multiple_of_type_align` helper for readability * Add a debug assertion that we aren't removing the wrong return value * Rename `RetPtr` stack slots to `StructReturnSlot` * Make `legalize_type_for_sret_store` more symmetrical to `legalized_type_for_sret` * rustfmt * Remove unnecessary loop labels * Do not pre-assign offsets to struct return stack slots Instead, let the existing frame layout algorithm decide where they should go. * Expand "sret" into explicit "struct return" in doc comment * typo: "than" -> "then" in comment * Fold test's debug message into the assertion itself

Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in bytecodealliance#1147.

Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in #1147.

Previously, the logic was wrong on two counts: - It used the bits of the entire vector (e.g. i32x4 -> 128) instead of just the lane bits (e.g. i32x4 -> 32). - It used the type of the first operand before it was bitcast to its correct type. Remember that, by default, vectors are handed around as i8x16 and we must bitcast them to their correct type for Cranelift's verifier; see bytecodealliance#1147 for discussion on this. This fix simply uses the type of the instruction itself, which is equivalent and hopefully less fragile to any changes.

Previously, the logic was wrong on two counts: - It used the bits of the entire vector (e.g. i32x4 -> 128) instead of just the lane bits (e.g. i32x4 -> 32). - It used the type of the first operand before it was bitcast to its correct type. Remember that, by default, vectors are handed around as i8x16 and we must bitcast them to their correct type for Cranelift's verifier; see #1147 for discussion on this. This fix simply uses the type of the instruction itself, which is equivalent and hopefully less fragile to any changes.

This is a fix related to the decision to use Cranelift's I8X16 type to represent Wasm's V128--it requires casting to maintain type correctness. See bytecodealliance#1147.

* Ensure GlobalSet on vectors are cast to Cranelift's I8X16 type This is a fix related to the decision to use Cranelift's I8X16 type to represent Wasm's V128--it requires casting to maintain type correctness. See #1147. * Enable SIMD spec test: simd_lane.wast

julian-seward1 · 2020-10-21T08:16:55Z

It seems to me that the simplest fix is simply to remove all NxM types from CL's type system, where NxM == 128, and replace them with a single V128 type. What benefit does having all these types bring us? They are not useful for typechecking CLIF that is derived from wasm, since wasm allows free intermixing of the types.

julian-seward1 · 2020-10-21T08:20:14Z

(notes copied from #2303):

Disadvantages of using bitcasts:

they make the logic in this file [code_translator.rs] fragile: miss out a bitcast for any reason, and there is the risk of the system failing in the verifier. At least for debug builds.
in the new backends, they potentially interfere with pattern matching on CLIF -- the patterns need to take into account the presence of bitcast nodes.
in the new backends, they get translated into machine-level vector-register-copy instructions, none of which are actually necessary. We then depend on the register allocator to coalesce them all out.
they increase the total number of CLIF nodes that have to be processed, hence slowing down the compilation pipeline. Also, the extra coalescing work generates a slowdown.

bjorn3 · 2020-10-21T08:21:38Z

Replacing NxM with V128 would require a new variant of many instructions for every lane size. The current design allows using the same iadd instruction for every integer type both scalar and vector. It is also useful for typechecking CLIF derived from rust, as rust doesn't allow mixing vector types without an explicit transmute.

julian-seward1 · 2020-10-21T08:31:06Z

One longer-term way around that, once the old backend is no longer needed, would be to abandon the DSL for defining CLIF instructions. And instead define them using a simple Rust enum, in the same way that the new backends define target specific instructions. And for the vector instructions, include a field of type

enum Laneage { I8X16, I16X8, I32X4, I64X2, F versions of the same, etc }

bjorn3 · 2020-10-21T08:33:07Z

Currently many instructions are both scalar and vector instructions at the same time. For example iadd. Adding a Laneage argument would add unnecessary noise when using scalars only.

alexcrichton transferred this issue from bytecodealliance/cranelift Feb 28, 2020

alexcrichton added the cranelift Issues related to the Cranelift code generator label Feb 28, 2020

wingo mentioned this issue Feb 28, 2020

Harmonize multiple-result ABI with SpiderMonkey #1163

Closed

abrown mentioned this issue Mar 13, 2020

Fix types of more SIMD instructions by casting #1319

Merged

abrown mentioned this issue May 20, 2020

Lane types WebAssembly/flexible-vectors#6

Closed

abrown mentioned this issue May 26, 2020

Enable SIMD lane spec test on x86 #1760

Merged

abrown mentioned this issue Oct 20, 2020

wasm->CLIF translation: consistently bitcast V128 values that are blo… #2303

Merged

abrown mentioned this issue Jul 19, 2021

Cranelift verifier errors on v128.store codegen with b8x16 input #3099

Closed

abrown mentioned this issue Aug 18, 2021

Fix validating wasm stores of boolean vector results #3202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Too many raw_bitcasts in SIMD code #1147

Too many raw_bitcasts in SIMD code #1147

abrown commented Nov 1, 2019

bnjbvr commented Nov 4, 2019

Uh oh!

abrown commented Nov 4, 2019

Uh oh!

bnjbvr commented Nov 5, 2019

Uh oh!

abrown commented Nov 6, 2019

Uh oh!

abrown commented Nov 7, 2019

Uh oh!

abrown commented Nov 13, 2019

Uh oh!

julian-seward1 commented Oct 21, 2020

Uh oh!

julian-seward1 commented Oct 21, 2020

Uh oh!

bjorn3 commented Oct 21, 2020

Uh oh!

julian-seward1 commented Oct 21, 2020 •

edited

Loading

Uh oh!

bjorn3 commented Oct 21, 2020

Uh oh!

Too many raw_bitcasts in SIMD code #1147

Too many raw_bitcasts in SIMD code #1147

Comments

abrown commented Nov 1, 2019

What is the feature or code improvement you would like to do in Cranelift?

What is the value of adding this in Cranelift?

Do you have an implementation plan, and/or ideas for data structures or algorithms to use?

Have you considered alternative implementations? If so, how are they better or worse than your proposal?

bnjbvr commented Nov 4, 2019

Uh oh!

abrown commented Nov 4, 2019

Uh oh!

bnjbvr commented Nov 5, 2019

Uh oh!

abrown commented Nov 6, 2019

Uh oh!

abrown commented Nov 7, 2019

Uh oh!

abrown commented Nov 13, 2019

Uh oh!

julian-seward1 commented Oct 21, 2020

Uh oh!

julian-seward1 commented Oct 21, 2020

Uh oh!

bjorn3 commented Oct 21, 2020

Uh oh!

julian-seward1 commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorn3 commented Oct 21, 2020

Uh oh!

julian-seward1 commented Oct 21, 2020 •

edited

Loading