-
Couldn't load subscription status.
- Fork 43
Floating-point rounding instructions #232
Conversation
|
No preferences for prototyping, we can probably squeeze them into |
|
No strong preferences either, it's somewhat awkward, but we could also do something in the range of 0xc2- 0xca if contiguous opcodes make this simpler, because I don't see the 64x2 AnyTrue/AllTrue and the widen/narrowing instructions to be relevant for 64x2 operations going forward. If we do have to spill over, it's not terrible but we can make that call when we decide to move past prototyping. |
|
These instructions aren't optional IMO. They're fundamental operations. Having to emulate them will be quite painful for many SIMD/SPMD kernels and vectorized math functions. I have a Perlin noise kernel that computes 24 floors per output pixel: In another example, I have a vectorized approximate math library. It can compute vectorized tan, sin, cos, log, exp, etc. It uses floor and round for range reduction: Without efficient round/floor/trunc, WebAssembly SIMD will be in the same position SSE2 is relative to SSE4.1. When we execute kernels on SSE2, we commonly get a 15-20% reduction in performance due to having to emulate round/floor/trunc on some kernels, or if they call sin/cos/tan/etc. These are very important operations. I am currently porting CppSPMD_Fast to WebAssembly, and the lack of efficient round/floor/trunc is going to hurt some kernels by quite a bit. I should have it up and running in 2-3 days. |
|
Worth noting is that the common way to emulate round/floor/trunc includes conversions back & forth to integers (obviously this is application-dependent as it assumes a specific range and is typically non-IEEE compliant for some operations); however, due to #173 this workaround is going to be slow. If the inputs are known to be within a 23-bit integer range or thereabouts, floating point addition can be abused to round, and it's probably possible to implement floor etc. in a similar fashion but that route doesn't seems like one we would want to recommend. |
Worth nothing that this stops working if FP rules are relaxed: |
|
@Maratyszcza any suggestions for ARM v7 instruction sequence? It will probably look a lot like the x86 SSE2 one? |
SIMD equivalents of the nearest/trunc/ceil/floor instructions
|
Updated opcodes post-renumbering, put into 0xd8-0xdf range |
|
Mapping to SSE2 is finished. @ngzhian ARMv7 NEON is quite different, because of its unique features:
|
|
Added ARMv7 NEON mapping for |
|
There's some magic going on there. Thanks Marat! |
|
All instructions mappings are finished, and PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to change the order of instructions to be consistent with their corresponding MVP intructions.
Co-authored-by: Thomas Lively <[email protected]>
Co-authored-by: Thomas Lively <[email protected]>
Co-authored-by: Thomas Lively <[email protected]>
As specified in WebAssembly/simd#232.
As specified in WebAssembly/simd#232.
|
Thanks @Maratyszcza for filing the issues, moving this to prototyping as on all platforms that we are using as a baseline currently these have a direct mapping to instructions, and on ARMv7, there is a precedent for them being slow as this is the case for the scalar versions of these operations as well, some implementations call out to the runtime to implement them. Moving to pending prototype data as we are prototyping them in V8, adding a retroactive label update. |
Summary: As specified in WebAssembly/simd#232. These instructions are implemented as LLVM intrinsics for now rather than normal ISel patterns to make these instructions opt-in. Once the instructions are merged to the spec proposal, the intrinsics will be replaced with proper ISel patterns. Reviewers: aheejin Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D81222
|
These will be available in the next version of Emscripten via |
|
Prototype in V8 is done for x64, ia32, ARM64. Still working on ARM. |
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982 UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
|
This has been accepted into the proposal [0] during the sync on 2020-09-04. This LGTM, as it is. Note, I would like https://github.com/WebAssembly/simd/blob/master/proposals/simd/NewOpcodes.md to be updated too, but it requires more tweaks (since there is a bit of a collision in opcodes for these instructions and the "reserved ones" under i64x2, and also ordering of instructions for presentation). But that's not a big problem, and can be worked on in the future. [0] https://docs.google.com/document/d/138cF6aOUa9RZC2tOR7AhlIQWdmX5EtpzXRTVDAN3bfo/edit# see "4. Floating point rounding" |
Co-authored-by: Thomas Lively <[email protected]>
Implement f32x4 and f64x2 nearest, trunc, ceil, and floor. These instructions were accepted into the proposal [0], this change removes all the ifdefs and todo guarding the prototypes, and moves these instructions out of the post-mvp flag. [0] WebAssembly/simd#232 Bug: v8:10906 Change-Id: I44ec21dd09f3bf7cf3cae5d35f70f9d2c178c4e4 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2406547 Commit-Queue: Zhi An Ng <[email protected]> Reviewed-by: Bill Budge <[email protected]> Cr-Commit-Position: refs/heads/master@{#69923}
Port 068cf20 Original Commit Message: Implement f32x4 and f64x2 nearest, trunc, ceil, and floor. These instructions were accepted into the proposal [0], this change removes all the ifdefs and todo guarding the prototypes, and moves these instructions out of the post-mvp flag. [0] WebAssembly/simd#232 [email protected], [email protected], [email protected], [email protected] BUG= LOG=N Change-Id: I02086255f635f1d47586fc74dd754426f6beccb0 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2411675 Reviewed-by: Milad Farazmand <[email protected]> Reviewed-by: Junliang Yan <[email protected]> Commit-Queue: Milad Farazmand <[email protected]> Cr-Commit-Position: refs/heads/master@{#69925}
…status. r=jseward Background: WebAssembly/simd#232 For all the rounding SIMD instructions: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on experimental Wasm instructions in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mappings. V8 in turn guarantees the correct mapping for LLVM and binaryen. Drive-by bug fix: the test predicate for f64 square root was wrong, it would round its argument to float. This did not matter for the test inputs we had but started to matter when I added more difficult inputs for testing rounding. Differential Revision: https://phabricator.services.mozilla.com/D92926
…status. r=jseward Background: WebAssembly/simd#232 For all the rounding SIMD instructions: - remove the internal 'Experimental' opcode suffix in the C++ code - remove the guard on experimental Wasm instructions in all the C++ decoders - move the test cases from simd/experimental.js to simd/ad-hack.js I have checked that current V8 and wasm-tools use the same opcode mappings. V8 in turn guarantees the correct mapping for LLVM and binaryen. Drive-by bug fix: the test predicate for f64 square root was wrong, it would round its argument to float. This did not matter for the test inputs we had but started to matter when I added more difficult inputs for testing rounding. Differential Revision: https://phabricator.services.mozilla.com/D92926
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
Implement some of the experimental SIMD opcodes that are supported by all of V8, LLVM, and Binaryen, for maximum compatibility with test content we might be exposed to. Most/all of these will probably make it into the spec, as they lead to substantial speedups in some programs, and they are deterministic. For spec and cpu mapping details, see: WebAssembly/simd#122 (pmax/pmin) WebAssembly/simd#232 (rounding) WebAssembly/simd#127 (dot product) WebAssembly/simd#237 (load zero) The wasm bytecode values used here come from the binaryen changes that are linked from those tickets, that's the best documentation right now. Current binaryen opcode mappings are here: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h Also: Drive-by fix for signatures of vroundss and vroundsd, these are unary operations and should follow the conventions for these with src/dest arguments, not src0/src1/dest. Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that take Operand source and FloatRegister destination. Differential Revision: https://phabricator.services.mozilla.com/D85982
…structions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions WebAssembly/simd#232 Pseudo-Minimum and Pseudo-Maximum instructions WebAssembly/simd#122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.
|
@tlively this wasn't added to NewOpcodes.md, just fyi in case you are looking at that doc for opcode organization. |
|
Oh, thanks for point that out. I had indeed missed them. |
Introduction
Floating-point round-to-integer is a widely used operation, available in many software and hardware specifications:
f32.nearest/f32.trunc/f32.ceil/f32.floor/f64.nearest/f64.trunc/f64.ceil/f64.floorscalar instruction in WebAssemblyrint/nearbyint/trunc/ceil/floorfunctions in C and C++ROUNDPSandROUNDPDinstructions in SSE4.1VRINTN/VRINTZ/VRINTP/VRINTMinstructions in ARMv8 AArch32FRINTN/FRINTZ/FRINTP/FRINTMinstructions in AArch64These PR introduce the rounding instructions in WebAssembly SIMD.
New instructions
f32x4.nearest/f64x2.nearestf32x4.trunc/f64x2.truncf32x4.ceil/f64x2.ceilf32x4.floor/f64x2.floorThe instructions match the scalar WebAssembly analogs both in names and in semantics.
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = f32x4.nearest(x)is lowered toVROUNDPS xmm_y, xmm_x, 0x08y = f32x4.trunc(x)is lowered toVROUNDPS xmm_y, xmm_x, 0x0By = f32x4.ceil(x)is lowered toVROUNDPS xmm_y, xmm_x, 0x0Ay = f32x4.floor(x)is lowered toVROUNDPS xmm_y, xmm_x, 0x09y = f64x2.nearest(x)is lowered toVROUNDPD xmm_y, xmm_x, 0x08y = f64x2.trunc(x)is lowered toVROUNDPD xmm_y, xmm_x, 0x0By = f64x2.ceil(x)is lowered toVROUNDPD xmm_y, xmm_x, 0x0Ay = f64x2.floor(x)is lowered toVROUNDPD xmm_y, xmm_x, 0x09x86/x86-64 processors with SSE4.1 instruction set
y = f32x4.nearest(x)is lowered toROUNDPS xmm_y, xmm_x, 0x08y = f32x4.trunc(x)is lowered toROUNDPS xmm_y, xmm_x, 0x0By = f32x4.ceil(x)is lowered toROUNDPS xmm_y, xmm_x, 0x0Ay = f32x4.floor(x)is lowered toROUNDPS xmm_y, xmm_x, 0x09y = f64x2.nearest(x)is lowered toROUNDPD xmm_y, xmm_x, 0x08y = f64x2.trunc(x)is lowered toROUNDPD xmm_y, xmm_x, 0x0By = f64x2.ceil(x)is lowered toROUNDPD xmm_y, xmm_x, 0x0Ay = f64x2.floor(x)is lowered toROUNDPD xmm_y, xmm_x, 0x09x86/x86-64 processors with SSE2 instruction set
y = f32x4.nearest(x)(yis NOTx) is lowered to:MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)CVTPS2DQ xmm_y, xmm_xCVTDQ2PS xmm_tmp1, xmm_yPCMPEQD xmm_y, xmm_tmp0POR xmm_y, xmm_tmp0ADDPS xmm_tmp0, xmm_xANDPS xmm_tmp0, xmm_yANDNPS xmm_y, xmm_tmp1ORPS xmm_y, xmm_tmp0y = f32x4.trunc(x)(yis NOTx) is lowered to:MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)CVTTPS2DQ xmm_y, xmm_xCVTDQ2PS xmm_tmp1, xmm_yPCMPEQD xmm_y, xmm_tmp0POR xmm_y, xmm_tmp0ADDPS xmm_tmp0, xmm_xANDPS xmm_tmp0, xmm_yANDNPS xmm_y, xmm_tmp1ORPS xmm_y, xmm_tmp0x = f32x4.ceil(x)is lowered to:CVTTPS2DQ xmm_tmp0, xmm_xMOVDQA xmm_tmp1, wasm_splat_u32(0x80000000)CVTDQ2PS xmm_tmp2, xmm_tmp0PCMPEQD xmm_tmp0, xmm_tmp1POR xmm_tmp0, xmm_tmp1MOVDQA xmm_tmp3, xmm_tmp0ANDPS xmm_tmp3, xmm_xANDNPS xmm_tmp0, xmm_tmp2ORPS xmm_tmp0, xmm_tmp3CMPLEPS xmm_x, xmm_tmp0ORPS xmm_x, xmm_tmp1MOVAPS xmm_tmp2, xmm_xANDPS xmm_tmp2, xmm_tmp0ADDPS xmm_tmp0, wasm_splat_f32(1.0f)ANDNPS xmm_x, xmm_tmp0ORPS xmm_x, xmm_tmp2y = f32x4.floor(x)(yis NOTx) is lowered to:MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)CVTTPS2DQ xmm_y, xmm_xCVTDQ2PS xmm_tmp1, xmm_yPCMPEQD xmm_y, xmm_tmp0POR xmm_y, xmm_tmp0MOVAPS xmm_tmp0, xmm_yANDPS xmm_tmp0, xmm_xANDNPS xmm_y, xmm_tmp1MOVAPS xmm_tmp1, xmm_xORPS xmm_y, xmm_tmp0CMPLTPS xmm_tmp1, xmm_yANDPS xmm_tmp1, wasm_splat_f32(1.0f)SUBPS xmm_y, xmm_tmp1y = f64x2.nearest(x)(yis NOTx) is lowered to:MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)MOVAPS xmm_y, xmm_xMOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)MOVAPS xmm_tmp2, xmm_tmp0ANDPS xmm_y, xmm_tmp1CMPLEPD xmm_tmp2, xmm_yADDPD xmm_y, xmm_tmp0SUBPD xmm_y, xmm_tmp0ANDNPS xmm_tmp2, xmm_tmp1MOVAPS xmm_tmp1, xmm_tmp2ANDNPS xmm_tmp1, xmm_xANDPS xmm_y, xmm_tmp2ORPS xmm_y, xmm_tmp1y = f64x2.trunc(x)(yis NOTx) is lowered to:MOVAPS xmm_y, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)MOVAPS xmm_tmp0, wasm_splat_f64(0x1.0p+52)MOVAPS xmm_tmp1, xmm_xANDPS xmm_tmp1, xmm_yMOVAPS xmm_tmp2, xmm_tmp0CMPNLEPD xmm_tmp2, xmm_tmp1ANDPS xmm_y, xmm_tmp2MOVAPS xmm_tmp2, xmm_tmp1ADDPD xmm_tmp2, xmm_tmp0SUBPD xmm_tmp2, xmm_tmp0CMPLTPD xmm_tmp1, xmm_tmp2ANDPS xmm_tmp1, wasm_splat_f64(1.0)SUBPD xmm_tmp2, xmm_tmp1ANDPS xmm_tmp2, xmm_yANDNPS xmm_y, xmm_xORPS xmm_y, xmm_tmp2y = f64x2.ceil(x)(yis NOTx) is lowered to:MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)MOVAPS xmm_y, xmm_xMOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)ANDPS xmm_y, xmm_tmp0MOVAPS xmm_tmp2, xmm_tmp1CMPNLEPD xmm_tmp2, xmm_yADDPD xmm_y, xmm_tmp1ANDPS xmm_tmp2, xmm_tmp0SUBPD xmm_y, xmm_tmp1ANDPS xmm_y, xmm_tmp2ANDNPS xmm_tmp2, xmm_xORPS xmm_tmp2, xmm_yMOVAPS xmm_y, xmm_tmp2MOVAPS xmm_tmp1, xmm_tmp2CMPLTPD xmm_y, xmm_xADDPD xmm_tmp1, wasm_splat_f64(1.0)ANDPS xmm_y, xmm_tmp0ANDPS xmm_tmp1, xmm_yANDNPS xmm_y, xmm_tmp2ORPS xmm_y, xmm_tmp1y = f64x2.floor(x)(yis NOTx) is lowered to:MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)MOVAPS xmm_tmp1, xmm_xMOVAPS xmm_tmp2, wasm_splat_f64(0x1.0p+52)ANDPS xmm_tmp1, xmm_tmp0MOVAPS xmm_y, xmm_tmp2CMPNLEPD xmm_y, xmm_tmp1ANDPS xmm_y, xmm_tmp0ADDPD xmm_tmp1, xmm_tmp2SUBPD xmm_tmp1, xmm_tmp2ANDPS xmm_tmp1, xmm_yANDNPS xmm_y, xmm_xMOVAPS xmm_tmp0, xmm_xORPS xmm_y, xmm_tmp1CMPLTPD xmm_tmp0, xmm_yANDPS xmm_tmp0, wasm_splat_f64(1.0)SUBPD xmm_y, xmm_tmp0ARM64 processors
y = f32x4.nearest(x)is lowered toFRINTN Vy.4S, Vx.4Sy = f32x4.trunc(x)is lowered toFRINTZ Vy.4S, Vx.4Sy = f32x4.ceil(x)is lowered toFRINTP Vy.4S, Vx.4Sy = f32x4.floor(x)is lowered toFRINTM Vy.4S, Vx.4Sy = f64x2.nearest(x)is lowered toFRINTN Vy.2D, Vx.2Dy = f64x2.trunc(x)is lowered toFRINTZ Vy.2D, Vx.2Dy = f64x2.ceil(x)is lowered toFRINTP Vy.2D, Vx.2Dy = f64x2.floor(x)is lowered toFRINTM Vy.2D, Vx.2DARM processors with ARMv8 (32-bit) instruction set
y = f32x4.nearest(x)is lowered toVRINTN.F32 Qy, Qxy = f32x4.trunc(x)is lowered toVRINTZ.F32 Qy, Qxy = f32x4.ceil(x)is lowered toVRINTP.F32 Qy, Qxy = f32x4.floor(x)is lowered toVRINTM.F32 Qy, Qxy = f64x2.nearest(x)is lowered toVRINTN.F64 Dy_lo, Dx_lo+VRINTN.F64 Dy_hi, Dx_hiy = f64x2.trunc(x)is lowered toVRINTZ.F64 Dy_lo, Dx_lo+VRINTZ.F64 Dy_hi, Dx_hiy = f64x2.ceil(x)is lowered toVRINTP.F64 Dy_lo, Dx_lo+VRINTP.F64 Dy_hi, Dx_hiy = f64x2.floor(x)is lowered toVRINTM.F64 Dy_lo, Dx_lo+VRINTM.F64 Dy_hi, Dx_hiARM processors with ARMv7 (32-bit) instruction set
y = f32x4.nearest(x)(yis NOTx) is lowered to:VMOV.I32 Qtmp0, 0x4B000000VABS.F32 Qtmp1, QxVACGT.F32 Qy, Qx, Qtmp0VADD.F32 Qtmp1, Qtmp1, Qtmp0VORR.I32 Qy, 0x80000000VSUB.F32 Qtmp1, Qtmp1, Qtmp0VBSL Qy, Qx, Qtmp1y = f32x4.trunc(x)(yis NOTx) is lowered to:VCVT.S32.F32 Qtmp0, QxVMOV.I32 Qtmp1, 0x4B000000VACGT.F32 Qy, Qtmp1, QxVCVT.F32.S32 Qtmp0, Qtmp0VBIC.I32 Qy, 0x80000000VBSL Qy, Qtmp0, Qxy = f32x4.ceil(x)(yis NOTx) is lowered to:VCVT.S32.F32 Qtmp0, QxVMOV.I32 Qtmp1, 0x4B000000VACGT.F32 Qtmp1, Qtmp1, QxVCVT.F32.S32 Qtmp0, Qtmp0VBIC.I32 Qtmp1, 0x80000000VBSL Qtmp1, Qtmp0, QxVMOV.F32 Qtmp0, 0x3F800000VCGE.F32 Qy, Qtmp1, QxVADD.F32 Qtmp0, Qtmp1, Qtmp0VORR.I32 Qy, 0x80000000VBSL Qy, Qtmp1, Qtmp0y = f32x4.floor(x)(yis NOTx) is lowered to:VCVT.S32.F32 Qtmp0, QxVMOV.I32 Qtmp1, 0x4B000000VACGT.F32 Qy, Qtmp1, QxVCVT.F32.S32 Qtmp0, Qtmp0VBIC.I32 Qy, 0x80000000VBSL Qy, Qtmp0, QxVMOV.F32 Qtmp1, 0x3F800000VCGT.F32 Qtmp0, Qy, QxVAND Qtmp0, Qtmp0, Qtmp1VSUB.F32 Qy, Qy, Qtmp0y = f64x2.round(x)(yis NOTx) is lowered to:VABS.F64 Dy_lo, Dx_loVABS.F64 Dy_hi, Dx_hiVLDR Dtmp0, 0x1.0p+52VSUB.F64 Dtmp1_lo, Dtmp0, Dy_loVSUB.F64 Dtmp1_hi, Dtmp0, Dy_hiVADD.F64 Dtmp2_lo, Dy_lo, Dtmp0VADD.F64 Dtmp2_hi, Dy_hi, Dtmp0VEOR Qy, Qx, QyVSHR.S64 Qtmp1, Qtmp1, 63VSUB.F64 Dtmp2_lo, Dtmp2_lo, Dtmp0VSUB.F64 Dtmp2_hi, Dtmp2_hi, Dtmp0VORR Qy, Qy, Qtmp1VBSL Qy, Qx, Qtmp2y = f64x2.trunc(x)(yis NOTx) is lowered to:VLDR Dtmp0, 0x1.0p+52VABS.F64 Qy_lo, Dx_loVABS.F64 Qy_hi, Dx_hiVADD.F64 Dtmp1_lo, Qy_lo, Dtmp0VADD.F64 Dtmp1_hi, Qy_hi, Dtmp0VSUB.F64 Dtmp2_lo, Dtmp0, Qy_loVSUB.F64 Dtmp2_hi, d9, Qy_hiVEOR Qtmp3, Qy, QxVSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0VSUB.F64 Dtmp1_hi, Dtmp1_hi, d9VLDR Dtmp0, 1.0VSHR.S64 Qtmp2, Qtmp2, 63VORR Qtmp3, Qtmp3, Qtmp2VSUB.I64 Qy, Qy, Qtmp1VSHR.S64 Qy, Qy, 63VAND Qy_lo, Qy_lo, Dtmp0VAND Qy_hi, Qy_hi, Dtmp0VSUB.F64 Qy_lo, Dtmp1_lo, QyVSUB.F64 Qy_hi, Dtmp1_hi, QxVBIT Qy, Qx, Qtmp3y = f64x2.ceil(x)(yis NOTx) is lowered to:VLDR Dtmp0, 0x1.0p+52VABS.F64 Dtmp1_lo, Dx_loVABS.F64 Dtmp1_hi, Dx_hiVSUB.F64 Dtmp2_lo, Dtmp0, Dtmp1_loVSUB.F64 Dtmp2_hi, Dtmp0, Dtmp1_hiVADD.F64 Dtmp3_lo, Dtmp1_lo, Dtmp0VADD.F64 Dtmp3_hi, Dtmp1_hi, Dtmp0VEOR Qtmp1, Qtmp1, QxVSHR.S64 Qtmp2, Qtmp2, 63VSUB.F64 Dtmp3_lo, Dtmp3_lo, Dtmp0VSUB.F64 Dtmp3_hi, Dtmp3_hi, Dtmp0VLDR Dtmp0, 1.0VORR Qtmp2, Qtmp2, Qtmp1VBSL Qtmp2, Qx, Qtmp3VSUB.F64 Dy_lo, Dtmp2_lo, Dx_loVSUB.F64 Dy_hi, Dtmp2_hi, Dx_hiVADD.F64 Dtmp3_lo, Dtmp2_lo, Dtmp0VADD.F64 Dtmp3_hi, Dtmp2_hi, Dtmp0VSHR.S64 Qy, Qy, 63VBIC Qy, Qy, Qtmp1VBSL Qy, Qtmp3, Qtmp2y = f64x2.floor(x)(yiD NOTx) iD lowereQ to:VLDR Dtmp0, 0x1.0p+52VABS.F64 Dy_lo, Dx_loVABS.F64 Dy_hi, Dx_hiVADD.F64 Dtmp1_lo, Dy_lo, Dtmp0VADD.F64 Dtmp1_hi, Dy_hi, Dtmp0VSUB.I64 Dtmp2_lo, Dtmp0, Dy_loVSUB.I64 Dtmp2_hi, Dtmp0, Dy_hiVEOR Qy, Qy, QxVSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0VSUB.F64 Dtmp1_hi, Dtmp1_hi, Dtmp0VLDR Dtmp0, 1.0VSHR.S64 Qtmp2, Qtmp2, 63VORR Qy, Qy, Qtmp2VBSL Qy, Qx, Qtmp1VSUB.F64 Dx_lo, Dx_lo, Dy_loVSUB.F64 Dx_hi, Dx_hi, Dy_hiVSHR.S64 Qtmp2, Qx, 63VAND Dtmp2_lo, Dtmp2_lo, Dtmp0VAND Dtmp2_hi, Dtmp2_hi, Dtmp0VSUB.F64 Dy_lo, Dy_lo, Dtmp2_loVSUB.F64 Dy_hi, Dy_hi, Dtmp2_hi