-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Too many raw_bitcasts in SIMD code #1147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for opening an issue. It seems a bit weird that the translator adds the raw_bitcasts v6 and v7; this might be a bug in the translator, or something else. Does GVN not reduce the number of redundant raw_bitcasts? One way to look at this would be to see the function's CLIF after it's been optimized. Should we introduce a For simple return values, probably we could just not introduce a raw_bitcast when the return type is a single type? (There could be different return (sub)types because of control flow, in which case they still need to be unified to a single type using raw_bitcasts.) |
I had to add code to do this because when I'm translating a single Wasm operation, I don't know if the next operation will be a
I don't think so; that snippet is (IIRC) from the
We could; I guess I was hoping for something more.
Not sure I completely understand. I'll ping you on IRC. |
Yeah, this sound a bit better than converting instructions' outputs. It would be more in line with what other instructions do: convert their input in an on-demand, lazy fashion. This means that for a function returning one or several I don't know how big of an effect this will have to reduce the number of raw_bitcasts, though, so we might need to do more. Regarding options brought up in the original post:
|
Here's my experiment: I converted all of the SIMD spec tests from https://github.com/WebAssembly/testsuite to WASM files using
I then added a bunch more bitcasts (e.g. the ones for restoring the type to what the function expected) and re-compiled the 95 tests:
I would ignore the times and cycles for now and focus on the ~30M more instructions executed due to adding the bitcasts. That's about 1.2% more instructions (30M/2347M). After adding the bitcasts, I checked again how many of the 220 WASM files would compile and now 115 would do so (previously 95). I'm still working on a run without the verifier enabled that compares all bitcasts vs no bitcasts at all but am running into compile issues. |
Here's that comparison: I removed all bitcasting which means types are likely incorrect but this should give a better comparison of the effect of something like point 3 above. When I do this, 98 WASM files compile. I run these with no bitcasting (bytecodealliance/cranelift@afd1761):
And then with all bitcasts (bytecodealliance/cranelift@41f910f):
I see around 1% (12M/1187M) additional instructions when bitcasting. What is deficient in this comparison is that as I look at the 98 WASM files compiled, not many of them are using instructions that would require bitcasts. In fact, 71 of these are from |
You mentioned trying to bitcast before a
I had been thinking about that and actually tried it today, with no luck. In
|
* Add x86 encodings for `bint` converting to `i8` and `i16` * Introduce tests for many multi-value returns * Support arbitrary numbers of return values This commit implements support for returning an arbitrary number of return values from a function. During legalization we transform multi-value signatures to take a struct return ("sret") return pointer, instead of returning its values in registers. Callers allocate the sret space in their stack frame and pass a pointer to it into the caller, and once the caller returns to them, they load the return values back out of the sret stack slot. The callee's return operations are legalized to store the return values through the given sret pointer. * Keep track of old, pre-legalized signatures When legalizing a call or return for its new legalized signature, we may need to look at the old signature in order to figure out how to legalize the call or return. * Add test for multi-value returns and `call_indirect` * Encode bool -> int x86 instructions in a loop * Rename `Signature::uses_sret` to `Signature::uses_struct_return_param` * Rename `p` to `param` * Add a clarifiying comment in `num_registers_required` * Rename `num_registers_required` to `num_return_registers_required` * Re-add newline * Handle already-assigned parameters in `num_return_registers_required` * Document what some debug assertions are checking for * Make "illegalizing" closure's control flow simpler * Add unit tests and comments for our rounding-up-to-the-next-multiple-of-a-power-of-2 function * Use `append_isnt_arg` instead of doing the same thing manually * Fix grammar in comment * Add `Signature::uses_special_{param,return}` helper functions * Inline the definition of `legalize_type_for_sret_load` for readability * Move sret legalization debug assertions out into their own function * Add `round_up_to_multiple_of_type_align` helper for readability * Add a debug assertion that we aren't removing the wrong return value * Rename `RetPtr` stack slots to `StructReturnSlot` * Make `legalize_type_for_sret_store` more symmetrical to `legalized_type_for_sret` * rustfmt * Remove unnecessary loop labels * Do not pre-assign offsets to struct return stack slots Instead, let the existing frame layout algorithm decide where they should go. * Expand "sret" into explicit "struct return" in doc comment * typo: "than" -> "then" in comment * Fold test's debug message into the assertion itself
Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in bytecodealliance#1147.
Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in bytecodealliance#1147.
Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in bytecodealliance#1147.
Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in bytecodealliance#1147.
Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in bytecodealliance#1147.
Because Wasm SIMD vectors store their type as `v128`, there is a mismatch between the more specific types Cranelift uses and Wasm SIMD. Because of this mismatch, Wasm SIMD translates to the default Cranelift type `I8X16`, causing issues when more specific type information is available (e.g. `I32x4`). To fix this, all incoming values to SIMD instructions are checked during translation (not runtime) and if necessary cast from `I8X16` to the appropriate type by functions like `optionally_bitcast_vector`, `pop1_with_bitcast` and `pop2_with_bitcast`. However, there are times when we must also cast to `I8X16` for outgoing values, as with `local.set` and `local.tee`. There are other ways of resolving this (e.g., see adding a new vector type, bytecodealliance/cranelift#1251) but we discussed staying with this casting approach in #1147.
Previously, the logic was wrong on two counts: - It used the bits of the entire vector (e.g. i32x4 -> 128) instead of just the lane bits (e.g. i32x4 -> 32). - It used the type of the first operand before it was bitcast to its correct type. Remember that, by default, vectors are handed around as i8x16 and we must bitcast them to their correct type for Cranelift's verifier; see bytecodealliance#1147 for discussion on this. This fix simply uses the type of the instruction itself, which is equivalent and hopefully less fragile to any changes.
Previously, the logic was wrong on two counts: - It used the bits of the entire vector (e.g. i32x4 -> 128) instead of just the lane bits (e.g. i32x4 -> 32). - It used the type of the first operand before it was bitcast to its correct type. Remember that, by default, vectors are handed around as i8x16 and we must bitcast them to their correct type for Cranelift's verifier; see bytecodealliance#1147 for discussion on this. This fix simply uses the type of the instruction itself, which is equivalent and hopefully less fragile to any changes.
Previously, the logic was wrong on two counts: - It used the bits of the entire vector (e.g. i32x4 -> 128) instead of just the lane bits (e.g. i32x4 -> 32). - It used the type of the first operand before it was bitcast to its correct type. Remember that, by default, vectors are handed around as i8x16 and we must bitcast them to their correct type for Cranelift's verifier; see bytecodealliance#1147 for discussion on this. This fix simply uses the type of the instruction itself, which is equivalent and hopefully less fragile to any changes.
Previously, the logic was wrong on two counts: - It used the bits of the entire vector (e.g. i32x4 -> 128) instead of just the lane bits (e.g. i32x4 -> 32). - It used the type of the first operand before it was bitcast to its correct type. Remember that, by default, vectors are handed around as i8x16 and we must bitcast them to their correct type for Cranelift's verifier; see #1147 for discussion on this. This fix simply uses the type of the instruction itself, which is equivalent and hopefully less fragile to any changes.
This is a fix related to the decision to use Cranelift's I8X16 type to represent Wasm's V128--it requires casting to maintain type correctness. See bytecodealliance#1147.
* Ensure GlobalSet on vectors are cast to Cranelift's I8X16 type This is a fix related to the decision to use Cranelift's I8X16 type to represent Wasm's V128--it requires casting to maintain type correctness. See #1147. * Enable SIMD spec test: simd_lane.wast
It seems to me that the simplest fix is simply to remove all |
(notes copied from #2303): Disadvantages of using bitcasts:
|
Replacing |
One longer-term way around that, once the old backend is no longer needed, would be to abandon the DSL for defining CLIF instructions. And instead define them using a simple Rust enum, in the same way that the new backends define target specific instructions. And for the vector instructions, include a field of type
|
Currently many instructions are both scalar and vector instructions at the same time. For example |
What is the feature or code improvement you would like to do in Cranelift?
During translation from Wasm to CLIF, a combination of Wasm's
v128
type and Cranelift's current type system forces us to add manyraw_bitcast
instructions between operations. For example, this Wasm code:Translates to this CLIF code:
This issue is to discuss if and how to remove these extra bitcasts.
What is the value of adding this in Cranelift?
The extra
raw_bitcasts
emit no machine code but they are confusing when troubleshooting and add extra memory and processing overhead during compilation.Do you have an implementation plan, and/or ideas for data structures or algorithms to use?
Some options:
add types to
load
andconst
: Concerns about integer vs floating-point instructions on x86 WebAssembly/simd#125 was discussed in the Wasm SIMD Sync meeting (SIMD Sync meeting 10/22/2019 Agenda WebAssembly/simd#121) and someone brought up that makingload
andconst
typed (e.g.f32x4.load
) would allow compilers to attach the correct types to values and retain them through the less-strongv128
operations (e.g.xor
). Concerns about integer vs floating-point instructions on x86 WebAssembly/simd#125 discusses this from a performance point of view but that addition would solve this issue.examine the DFG: another approach would be to look at the DFG to figure out the types of predecessors as mentioned in Initial 128-bit SIMD proposal WebAssembly/simd#1 (comment). This, however, would have to be extended for type signatures. Cranelift would have to look at the instructions in a function to figure out how the
v128
parameters are used. In the functionadd-sub
above, with signature(param v128 v128 v128)
, the addition and subtraction make this clear but some functions will make this analysis impossible.add a
V128
type to Cranelift: Cranelift's type system could be extended to include aV128
type in Cranelift's type system that would include allINxN
,FNxN
, andBNxN
types. The instruction types would stay the same (e.g.iadd
should still only accept integers) but type-checking could be relaxed to allow theV128
type to be used as one of its valid subtypes. This opens up a mechanism to get around the type-checking but arguably that already exists withraw_bitcast
. Code that knows its types would remain as-is but Wasm-to-CLIF translated code could use theV128
a bit more naturally than theraw_bitcast
s.do nothing: I brought this up a long time ago when talking to @sunfishcode and that seemed the best thing to do then--I'm opening this issue to discuss whether that is still the case.
Have you considered alternative implementations? If so, how are they better or worse than your proposal?
See above.
The text was updated successfully, but these errors were encountered: