Filling short slices is slow even if they are provably short #71066
Labels
A-iterators
Area: Iterators
C-enhancement
Category: An issue proposing an enhancement or a PR with one.
I-slow
Issue: Problems and improvements with respect to performance of generated code.
T-compiler
Relevant to the compiler team, which will review and decide on the PR/issue.
Take this method as baseline (playground):
padding_length can take on values in the rage [0, 2], so we have to write either zero, one, or two bytes into our slice.
Benchmarking all three cases gives us these timings:
Chasing more idiomatic code we switch to iterators for the loop (playground):
Given that this loop barely does any iterations and has thus not much performance potential in avoiding bounds checks, we expect about the same runtime.
Oof, up to 68% slower...
Let's see what's happening
The baseline version copies one byte at a time, which isn't that bad when you copy at most two bytes:
In comparison the iterator version does a full memset:
For long slices memset would be a good choice but for just a few bytes the overhead is simply too big. When using a constant range for testing, we see the compiler emitting different combinations of
movb
,movw
,movl
,movabsq
,movaps+movups
up to a length of 256 byte. Only for slices longer than that a memset is used.At some point the compiler already realizes that
padding_length
is always< 3
as anassert!(padding_length < 3);
gets optimized out completely. Whether this information is not available at the right place or is simply not utilized, I can't tell.Wrapping the iterator version's loop in a
match
results in two things - the fastest version and a monstrosity (playground).It uses a
movw
when writing two bytes, which explains why this version is faster than baseline only in that case.All measurements taken with criterion.rs and Rust 1.42.0 on an i5-3450. Care has been taken to ensure a low noise environment with reproducible results.
The text was updated successfully, but these errors were encountered: