Filling short slices is slow even if they are provably short #71066

LingMan · 2020-04-12T16:54:56Z

Take this method as baseline (playground):

pub fn add_padding(input_len: usize, output: &mut [u8]) -> usize {
    let rem = input_len % 3;
    let padding_length = (3 - rem) % 3;
    for i in 0..padding_length {
        output[i] = b'=';
    }

    padding_length
}

padding_length can take on values in the rage [0, 2], so we have to write either zero, one, or two bytes into our slice.
Benchmarking all three cases gives us these timings:

0: time:   [2.8673 ns 2.8692 ns 2.8714 ns]
1: time:   [3.2384 ns 3.2411 ns 3.2443 ns]
2: time:   [3.7454 ns 3.7478 ns 3.7507 ns]

Chasing more idiomatic code we switch to iterators for the loop (playground):

pub fn add_padding(input_len: usize, output: &mut [u8]) -> usize {
    let rem = input_len % 3;
    let padding_length = (3 - rem) % 3;
    for byte in output[..padding_length].iter_mut() {
        *byte = b'=';
    }

    padding_length
}

Given that this loop barely does any iterations and has thus not much performance potential in avoiding bounds checks, we expect about the same runtime.

absolute:
0: time:   [3.2053 ns 3.2074 ns 3.2105 ns]
1: time:   [5.4453 ns 5.4475 ns 5.4501 ns]
2: time:   [6.0211 ns 6.0254 ns 6.0302 ns]

relative compared to baseline:
0: time:   [+11.561% +11.799% +12.030%]
1: time:   [+67.946% +68.287% +68.647%]
2: time:   [+60.241% +60.600% +60.932%]

Oof, up to 68% slower...

Let's see what's happening

The baseline version copies one byte at a time, which isn't that bad when you copy at most two bytes:

playground::add_padding:
	pushq	%rax
	movq	%rdx, %rcx
	movabsq	$-6148914691236517205, %r8
	movq	%rdi, %rax
	mulq	%r8
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rdi
	xorq	$3, %rdi
	movq	%rdi, %rax
	mulq	%r8
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rdi
	je	.LBB0_4
	xorl	%eax, %eax

.LBB0_2:
	cmpq	%rax, %rcx
	je	.LBB0_5
	movb	$61, (%rsi,%rax)
	addq	$1, %rax
	cmpq	%rdi, %rax
	jb	.LBB0_2

.LBB0_4:
	movq	%rdi, %rax
	popq	%rcx
	retq

[snip panic code]

In comparison the iterator version does a full memset:

playground::add_padding:
	pushq	%rbx
	movq	%rdx, %rcx
	movq	%rdi, %rbx
	movabsq	$-6148914691236517205, %rdi
	movq	%rbx, %rax
	mulq	%rdi
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rbx
	xorq	$3, %rbx
	movq	%rbx, %rax
	mulq	%rdi
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rbx
	cmpq	%rcx, %rbx
	ja	.LBB0_4
	testq	%rbx, %rbx
	je	.LBB0_3
	movq	%rsi, %rdi
	movl	$61, %esi
	movq	%rbx, %rdx
	callq	*memset@GOTPCREL(%rip)

.LBB0_3:
	movq	%rbx, %rax
	popq	%rbx
	retq

[snip panic code]

For long slices memset would be a good choice but for just a few bytes the overhead is simply too big. When using a constant range for testing, we see the compiler emitting different combinations of movb, movw, movl, movabsq,movaps+movups up to a length of 256 byte. Only for slices longer than that a memset is used.

At some point the compiler already realizes that padding_length is always < 3 as an assert!(padding_length < 3); gets optimized out completely. Whether this information is not available at the right place or is simply not utilized, I can't tell.

Wrapping the iterator version's loop in a match results in two things - the fastest version and a monstrosity (playground).

pub fn add_padding(input_len: usize, output: &mut [u8]) -> usize {
    let rem = input_len % 3;
    let padding_length = (3 - rem) % 3;
    
    match padding_length {
        0 => {
            for byte in output[..padding_length].iter_mut() {
                *byte = b'=';
            }
        },
        1 => {
            for byte in output[..padding_length].iter_mut() {
                *byte = b'=';
            }
        }
        2 => {
            for byte in output[..padding_length].iter_mut() {
                *byte = b'=';
            }
        },
        _ => unreachable!()
    }
    

    padding_length
}

absolute:
0: time:  [2.8705 ns 2.8749 ns 2.8797 ns]
1: time:  [3.2446 ns 3.2470 ns 3.2499 ns]
2: time:  [3.4626 ns 3.4753 ns 3.4894 ns]

relative compared to baseline:
0: time:  [-0.1432% +0.0826% +0.3052%]
1: time:  [+0.0403% +0.2527% +0.4629%]
2: time:  [-7.6693% -7.3060% -6.9496%]

It uses a movw when writing two bytes, which explains why this version is faster than baseline only in that case.

All measurements taken with criterion.rs and Rust 1.42.0 on an i5-3450. Care has been taken to ensure a low noise environment with reproducible results.

The text was updated successfully, but these errors were encountered:

workingjubilee mentioned this issue Nov 14, 2021

Obvious bounds checks don't get optimized unless I add a if+panic #90794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filling short slices is slow even if they are provably short #71066

Filling short slices is slow even if they are provably short #71066

LingMan commented Apr 12, 2020

Filling short slices is slow even if they are provably short #71066

Filling short slices is slow even if they are provably short #71066

Comments

LingMan commented Apr 12, 2020