-
Notifications
You must be signed in to change notification settings - Fork 13.3k
u32::from_be_bytes(*bytes)
generates suboptimal code on riscv
#88852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not slices, this is arrays, yes? Taking array by value converts this examples on |
I'm not sure if these are considered slices or arrays -- I'm a bit fuzzy on the terminology. Some background: This issue was spotted in some networking code that operates on On RISC-V, taking array by value is still less efficient than manually packing because it adds two more instructions. |
Ofc there possibility that implementation of bswap for that target not the fastest. |
I believe borrowing arrays can coerce them into slices in some cases, though I am always slightly fuzzy on the rules for that, but it is something to watch out for, as an array generally has a known length and so is easier to optimize. Likewise small byte counts like this should generally be taken by-value and not by-reference. |
I should note that "two instructions" is not necessarily a significant performance hit. I am not super-familiar with the performance details of RISC-V architectures, but there is the usual issue of superscalar out-of-order execution leading to surprising, non-intuitive results, and for RISC-V at least some implementers advocate "macro-op fusion". At decode time the architecture is entitled to fuse two machine-assembly instructions and execute them as one machine-internal operation. So we should be very careful when we talk about "efficient"... do we mean in speed or in binary size? On the large scale, they are often very similar, but not always on this tiny nanosecond scale. This is why I added When changing your version of the code to pub fn from_be_array_manual(bytes: [u8; 4]) -> u32 {
(bytes[0] as u32) << 24
| ((bytes[1] as u32) << 16)
| ((bytes[2] as u32) << 8)
| (bytes[3] as u32)
} it emits identical code to pub fn from_be_array_intrinsic(bytes: [u8; 4]) -> u32 {
u32::from_be_bytes(bytes)
} I think this issue still matters as this is unexpectedly heavy in the case with the indirection: Either rustc or LLVM should see we do a copy anyways and thus try to make everything more transparent. Since you mentioned this is an issue in a larger codebase, I would like to know if you have a slightly more complex example that shows it does not optimize well in the hot path of some loop: that would help troubleshoot this greatly. |
Trying the basic bit manipulation target feature mentioned in #100528, some of these examples reduce in size. pub fn from_be_array_intrinsic(bytes: [u8; 4]) -> u32 {
u32::from_be_bytes(bytes)
} Without the target feature,
With
|
@xobs could you please check whether current rustc behaves as you expect so we can close this? |
Yes, this behaves as expected. Thank you for following up! |
I tried this code:
I expected to see this happen: Both should produce equal output, or at the very least the one using the intrinsic should be acceptable. When building on x86 these produce the same output, and on
arm-unknown-linux-gnueabi
the output is different but not terrible. Onriscv64gc-unknown-linux-gnu
the asm generated by the intrinsic is massive.Instead, this happened:
Meta
rustc --version --verbose
:Godbolt link: https://godbolt.org/z/aPPdnond5
The text was updated successfully, but these errors were encountered: