Skip to content

Filling short slices is slow even if they are provably short #71066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
LingMan opened this issue Apr 12, 2020 · 0 comments
Open

Filling short slices is slow even if they are provably short #71066

LingMan opened this issue Apr 12, 2020 · 0 comments
Labels
A-iterators Area: Iterators C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@LingMan
Copy link
Contributor

LingMan commented Apr 12, 2020

Take this method as baseline (playground):

pub fn add_padding(input_len: usize, output: &mut [u8]) -> usize {
    let rem = input_len % 3;
    let padding_length = (3 - rem) % 3;
    for i in 0..padding_length {
        output[i] = b'=';
    }

    padding_length
}

padding_length can take on values in the rage [0, 2], so we have to write either zero, one, or two bytes into our slice.
Benchmarking all three cases gives us these timings:

0: time:   [2.8673 ns 2.8692 ns 2.8714 ns]
1: time:   [3.2384 ns 3.2411 ns 3.2443 ns]
2: time:   [3.7454 ns 3.7478 ns 3.7507 ns]

Chasing more idiomatic code we switch to iterators for the loop (playground):

pub fn add_padding(input_len: usize, output: &mut [u8]) -> usize {
    let rem = input_len % 3;
    let padding_length = (3 - rem) % 3;
    for byte in output[..padding_length].iter_mut() {
        *byte = b'=';
    }

    padding_length
}

Given that this loop barely does any iterations and has thus not much performance potential in avoiding bounds checks, we expect about the same runtime.

absolute:
0: time:   [3.2053 ns 3.2074 ns 3.2105 ns]
1: time:   [5.4453 ns 5.4475 ns 5.4501 ns]
2: time:   [6.0211 ns 6.0254 ns 6.0302 ns]

relative compared to baseline:
0: time:   [+11.561% +11.799% +12.030%]
1: time:   [+67.946% +68.287% +68.647%]
2: time:   [+60.241% +60.600% +60.932%]

Oof, up to 68% slower...

Let's see what's happening

The baseline version copies one byte at a time, which isn't that bad when you copy at most two bytes:

playground::add_padding:
	pushq	%rax
	movq	%rdx, %rcx
	movabsq	$-6148914691236517205, %r8
	movq	%rdi, %rax
	mulq	%r8
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rdi
	xorq	$3, %rdi
	movq	%rdi, %rax
	mulq	%r8
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rdi
	je	.LBB0_4
	xorl	%eax, %eax

.LBB0_2:
	cmpq	%rax, %rcx
	je	.LBB0_5
	movb	$61, (%rsi,%rax)
	addq	$1, %rax
	cmpq	%rdi, %rax
	jb	.LBB0_2

.LBB0_4:
	movq	%rdi, %rax
	popq	%rcx
	retq

[snip panic code]

In comparison the iterator version does a full memset:

playground::add_padding:
	pushq	%rbx
	movq	%rdx, %rcx
	movq	%rdi, %rbx
	movabsq	$-6148914691236517205, %rdi
	movq	%rbx, %rax
	mulq	%rdi
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rbx
	xorq	$3, %rbx
	movq	%rbx, %rax
	mulq	%rdi
	shrq	%rdx
	leaq	(%rdx,%rdx,2), %rax
	subq	%rax, %rbx
	cmpq	%rcx, %rbx
	ja	.LBB0_4
	testq	%rbx, %rbx
	je	.LBB0_3
	movq	%rsi, %rdi
	movl	$61, %esi
	movq	%rbx, %rdx
	callq	*memset@GOTPCREL(%rip)

.LBB0_3:
	movq	%rbx, %rax
	popq	%rbx
	retq

[snip panic code]

For long slices memset would be a good choice but for just a few bytes the overhead is simply too big. When using a constant range for testing, we see the compiler emitting different combinations of movb, movw, movl, movabsq,movaps+movups up to a length of 256 byte. Only for slices longer than that a memset is used.

At some point the compiler already realizes that padding_length is always < 3 as an assert!(padding_length < 3); gets optimized out completely. Whether this information is not available at the right place or is simply not utilized, I can't tell.

Wrapping the iterator version's loop in a match results in two things - the fastest version and a monstrosity (playground).

pub fn add_padding(input_len: usize, output: &mut [u8]) -> usize {
    let rem = input_len % 3;
    let padding_length = (3 - rem) % 3;
    
    match padding_length {
        0 => {
            for byte in output[..padding_length].iter_mut() {
                *byte = b'=';
            }
        },
        1 => {
            for byte in output[..padding_length].iter_mut() {
                *byte = b'=';
            }
        }
        2 => {
            for byte in output[..padding_length].iter_mut() {
                *byte = b'=';
            }
        },
        _ => unreachable!()
    }
    

    padding_length
}
absolute:
0: time:  [2.8705 ns 2.8749 ns 2.8797 ns]
1: time:  [3.2446 ns 3.2470 ns 3.2499 ns]
2: time:  [3.4626 ns 3.4753 ns 3.4894 ns]

relative compared to baseline:
0: time:  [-0.1432% +0.0826% +0.3052%]
1: time:  [+0.0403% +0.2527% +0.4629%]
2: time:  [-7.6693% -7.3060% -6.9496%]

It uses a movw when writing two bytes, which explains why this version is faster than baseline only in that case.

All measurements taken with criterion.rs and Rust 1.42.0 on an i5-3450. Care has been taken to ensure a low noise environment with reproducible results.

@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-iterators Area: Iterators labels Apr 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

2 participants