-
Notifications
You must be signed in to change notification settings - Fork 13.4k
nano-optimization for memchr::repeat_byte #50398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r? @KodrAus (rust_highfive has picked a reviewer for you, use r? to override) |
cc @Manishearth who wrote the original code |
cc @BurntSushi who actually wrote this code (I just and @jimblandy whose C code this was originally adapted from |
cc @bluss, who is the one who actually wrote the original fast memchr fallback implementation. :-) This does look good to me though! Cute trick. |
tldr nobody wrote it and it just appeared |
The improvement here (if any) is greatly dependent on relative instruction latency between bitwise ops and multiply. Most of the modern architectures have a fast multiply (throughput of 1+ insns per cycle) but latency is greater (3+ cycles). To contrast, bitwise instructions usually have a latency of 1 cycle (and througput of 2 to 4 insns per cycle), so as long as there are no more than 3 bitwise instructions in the critical path, the bitwise code will be almost universally faster (there seem to be 4 for 32-bit targets). It might be considerably worse for 32-bit architectures which do not have a multiply instruction at all. I verified that we (or LLVM) do not support any 32+-bit targets which do not have a native instruction. I observed some backends such as MIPS and Lanai to translate the multiply back into the original sequence of bitwise ops, presumably because it is more efficient to do so there. I observed x86 backend to do the same in some cases as well, but, notably, not ARM. With this in mind, it seems pretty save to @bors r+ |
📌 Commit 1cefb5c has been approved by |
Ah, the pain of spotting a typo, but not wanting to confuse bors by editing the comment. |
nano-optimization for memchr::repeat_byte This replaces the multiple shifts & bitwise or with a single multiplication In my benchmarks this performs equally well or better, especially on 64bit systems (it shaves a stable nanosecond on my skylake). This may go against conventional wisdom, but the shifts and bitwise ors cannot be pipelined because of hard data dependencies. While it may or may not be worthwile from an optimization standpoint, it also reduces code size, so there's basically no downside.
☀️ Test successful - status-appveyor, status-travis |
This replaces the multiple shifts & bitwise or with a single multiplication
In my benchmarks this performs equally well or better, especially on 64bit systems (it shaves a stable nanosecond on my skylake). This may go against conventional wisdom, but the shifts and bitwise ors cannot be pipelined because of hard data dependencies.
While it may or may not be worthwile from an optimization standpoint, it also reduces code size, so there's basically no downside.