`u32::from_be_bytes(*bytes)` generates suboptimal code on riscv #88852

xobs · 2021-09-11T11:12:22Z

I tried this code:

pub fn from_be_slice_manual(bytes: &[u8; 4]) -> u32 {
    (bytes[0] as u32) << 24
    | ((bytes[1] as u32) << 16)
    | ((bytes[2] as u32) << 8)
    | (bytes[3] as u32)
}

pub fn from_be_slice_intrinsic(bytes: &[u8; 4]) -> u32 {
    u32::from_be_bytes(*bytes)
}

I expected to see this happen: Both should produce equal output, or at the very least the one using the intrinsic should be acceptable. When building on x86 these produce the same output, and on arm-unknown-linux-gnueabi the output is different but not terrible. On riscv64gc-unknown-linux-gnu the asm generated by the intrinsic is massive.

Instead, this happened:

example::from_be_slice_manual:
        lb      a1, 0(a0)
        lbu     a2, 1(a0)
        slli    a1, a1, 24
        lbu     a3, 2(a0)
        slli    a2, a2, 16
        lbu     a0, 3(a0)
        or      a1, a1, a2
        slli    a2, a3, 8
        or      a1, a1, a2
        or      a0, a0, a1
        ret

example::from_be_slice_intrinsic:
        lbu     a1, 1(a0)
        lbu     a2, 0(a0)
        lb      a3, 3(a0)
        lbu     a0, 2(a0)
        slli    a1, a1, 8
        or      a1, a1, a2
        slli    a2, a3, 8
        or      a0, a0, a2
        slli    a0, a0, 16
        or      a0, a0, a1
        slli    a1, a0, 8
        addi    a2, zero, 255
        slli    a3, a2, 32
        and     a1, a1, a3
        slli    a3, a0, 24
        slli    a4, a2, 40
        and     a3, a3, a4
        or      a1, a1, a3
        slli    a3, a0, 40
        slli    a2, a2, 48
        and     a2, a2, a3
        slli    a0, a0, 56
        or      a0, a0, a2
        or      a0, a0, a1
        srli    a0, a0, 32
        ret

Meta

rustc --version --verbose:

rustc 1.55.0
1
rustc 1.55.0 - 786ms

rustc 1.55.0 (c8dfcfe04 2021-09-06)
binary: rustc
commit-hash: c8dfcfe046a7680554bf4eb612bad840e7631c4b
commit-date: 2021-09-06
host: x86_64-unknown-linux-gnu
release: 1.55.0
LLVM version: 12.0.1

Godbolt link: https://godbolt.org/z/aPPdnond5

The text was updated successfully, but these errors were encountered:

klensy · 2021-09-12T19:51:35Z

This is not slices, this is arrays, yes?

Taking array by value converts from_be_slice_intrinsic to from_be_slice_manual (or better)
arm-unknown-linux-gnueabi https://godbolt.org/z/684TfW15E
riscv64gc-unknown-linux-gnu https://godbolt.org/z/fGYc5Yvcr

this examples on rustc 1.57.0-nightly (b69fe5726 2021-09-10) or newer

xobs · 2021-09-13T01:48:58Z

I'm not sure if these are considered slices or arrays -- I'm a bit fuzzy on the terminology.

Some background: This issue was spotted in some networking code that operates on &[u8], and gets subslices with val[0..4]. These subslices are then converted into integers, which is why we're using big endian. When bytes come in from the network, they're behind a slice, which is why we need to dereference things at some point.

On RISC-V, taking array by value is still less efficient than manually packing because it adds two more instructions.

https://godbolt.org/z/5Khj5M3GK

klensy · 2021-09-13T02:08:08Z

u32::from_be_bytes (with --emit=llvm-ir flag to view IR) use single instruction call i32 @llvm.bswap.i32(i32 %0), where from_be_slice_manual compiles into a bunch of instructions. Even if manual version it shorter, is it faster?

Ofc there possibility that implementation of bswap for that target not the fastest.

workingjubilee · 2021-09-13T22:14:22Z

I'm not sure if these are considered slices or arrays -- I'm a bit fuzzy on the terminology.

[u8; 4] is an array.
[u8] is a slice.
&[u8; 4] is a borrowed array (or reference to an array).
&[u8] is a borrowed slice (or reference to a slice).
[u8] is dynamically sized so it must be used via indirection.

I believe borrowing arrays can coerce them into slices in some cases, though I am always slightly fuzzy on the rules for that, but it is something to watch out for, as an array generally has a known length and so is easier to optimize. Likewise small byte counts like this should generally be taken by-value and not by-reference.

workingjubilee · 2021-09-13T22:56:22Z

I should note that "two instructions" is not necessarily a significant performance hit. I am not super-familiar with the performance details of RISC-V architectures, but there is the usual issue of superscalar out-of-order execution leading to surprising, non-intuitive results, and for RISC-V at least some implementers advocate "macro-op fusion". At decode time the architecture is entitled to fuse two machine-assembly instructions and execute them as one machine-internal operation. So we should be very careful when we talk about "efficient"... do we mean in speed or in binary size? On the large scale, they are often very similar, but not always on this tiny nanosecond scale. This is why I added I-heavy and not I-slow.

When changing your version of the code to

pub fn from_be_array_manual(bytes: [u8; 4]) -> u32 {
    (bytes[0] as u32) << 24
    | ((bytes[1] as u32) << 16)
    | ((bytes[2] as u32) << 8)
    | (bytes[3] as u32)
}

it emits identical code to

pub fn from_be_array_intrinsic(bytes: [u8; 4]) -> u32 {
    u32::from_be_bytes(bytes)
}

I think this issue still matters as this is unexpectedly heavy in the case with the indirection: Either rustc or LLVM should see we do a copy anyways and thus try to make everything more transparent. Since you mentioned this is an issue in a larger codebase, I would like to know if you have a slightly more complex example that shows it does not optimize well in the hot path of some loop: that would help troubleshoot this greatly.

memoryruins · 2022-08-24T15:00:50Z

Trying the basic bit manipulation target feature mentioned in #100528, some of these examples reduce in size.

pub fn from_be_array_intrinsic(bytes: [u8; 4]) -> u32 {
    u32::from_be_bytes(bytes)
}

Without the target feature,

example::from_be_array_intrinsic:
        srliw   a1, a0, 8
        lui     a2, 16
        addiw   a2, a2, -256
        and     a1, a1, a2
        srliw   a2, a0, 24
        or      a1, a1, a2
        slli    a2, a0, 8
        lui     a3, 4080
        and     a2, a2, a3
        slliw   a0, a0, 24
        or      a0, a0, a2
        or      a0, a0, a1
        ret

With -Ctarget-feature=+zbb,

example::from_be_array_intrinsic:
        rev8    a0, a0
        srli    a0, a0, 32
        ret

https://godbolt.org/z/5abahK6r1

istankovic · 2023-11-03T21:06:17Z

With 1.73, even without using the target feature, I think this is good enough with respect to the original description:

istankovic · 2024-01-29T19:13:45Z

@xobs could you please check whether current rustc behaves as you expect so we can close this?

xobs · 2024-01-31T01:36:01Z

Yes, this behaves as expected. Thank you for following up!

xobs added the C-bug Category: This is a bug. label Sep 11, 2021

workingjubilee added I-heavy Issue: Problems and improvements with respect to binary size of generated code. O-riscv Target: RISC-V architecture labels Sep 12, 2021

nagisa added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Sep 13, 2021

workingjubilee added the A-codegen Area: Code generation label Sep 13, 2021

Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

xobs closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`u32::from_be_bytes(*bytes)` generates suboptimal code on riscv #88852

`u32::from_be_bytes(*bytes)` generates suboptimal code on riscv #88852

xobs commented Sep 11, 2021

klensy commented Sep 12, 2021 •

edited

Loading

xobs commented Sep 13, 2021

klensy commented Sep 13, 2021 •

edited

Loading

workingjubilee commented Sep 13, 2021

workingjubilee commented Sep 13, 2021

memoryruins commented Aug 24, 2022

istankovic commented Nov 3, 2023

istankovic commented Jan 29, 2024

xobs commented Jan 31, 2024

u32::from_be_bytes(*bytes) generates suboptimal code on riscv #88852

u32::from_be_bytes(*bytes) generates suboptimal code on riscv #88852

Comments

xobs commented Sep 11, 2021

Meta

klensy commented Sep 12, 2021 • edited Loading

xobs commented Sep 13, 2021

klensy commented Sep 13, 2021 • edited Loading

workingjubilee commented Sep 13, 2021

workingjubilee commented Sep 13, 2021

memoryruins commented Aug 24, 2022

istankovic commented Nov 3, 2023

istankovic commented Jan 29, 2024

xobs commented Jan 31, 2024

`u32::from_be_bytes(*bytes)` generates suboptimal code on riscv #88852

`u32::from_be_bytes(*bytes)` generates suboptimal code on riscv #88852

klensy commented Sep 12, 2021 •

edited

Loading

klensy commented Sep 13, 2021 •

edited

Loading