Skip to content

Unnecessary stack usage #88930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
newpavlov opened this issue Sep 14, 2021 · 9 comments
Open

Unnecessary stack usage #88930

newpavlov opened this issue Sep 14, 2021 · 9 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@newpavlov
Copy link
Contributor

newpavlov commented Sep 14, 2021

I have the following simple SIMD-powered function which tests whether points are inside one of bounding boxes:

pub unsafe fn foo(
    x: &[__m256i; N],
    y: &[__m256i; N],
    z: &[__m256i; N],
    bboxes: &[[__m256i; 6]],
) -> [__m256i; N] {
    let mut res = [_mm256_setzero_si256(); N];
    for bbox in bboxes {
        for i in 0..N {
            let tx = _mm256_and_si256(
                _mm256_cmpgt_epi32(x[i], bbox[0]),
                _mm256_cmpgt_epi32(bbox[1], x[i]),
            );
            let ty = _mm256_and_si256(
                _mm256_cmpgt_epi32(y[i], bbox[2]),
                _mm256_cmpgt_epi32(bbox[3], y[i]),
            );
            let t = _mm256_and_si256(tx, ty);
            let tz = _mm256_and_si256(
                _mm256_cmpgt_epi32(z[i], bbox[4]),
                _mm256_cmpgt_epi32(bbox[5], z[i]),
            );
            let t = _mm256_and_si256(t, tz);
            res[i] = _mm256_or_si256(res[i], t);
        }
    }
    res
}

By inspecting the generated assembly we can see that for some reason it caches coordinates to stack and reads them from it each iteration instead of using the input pointers. The same behavior can be observed for a function which processes coordinate slices. This caching looks quite redundant to me, especially considering that noalias is enabled (i.e. compiler should know that memory at which coordinates are stored can not change during function execution).

It looks like LLVM correctly moves coordinate loads from the inner loop using its infinite virtual registers. And it's exactly the behavior we want when there is enough physical registers. But when it's not true, it spills virtual register values to stack instead of relying on the original locations.

On Rust 1.51 code from the first link does not have this issue, but not from the second one.

UPD: See this comment for additional example affecting cryptographic code..

@klensy
Copy link
Contributor

klensy commented Sep 14, 2021

Creating res via MaybeUninit reduces used stack size, if i understand (and implemented) it correct.

@newpavlov
Copy link
Contributor Author

newpavlov commented Sep 14, 2021

You get UB this way. And even before that, it's an incorrect algorithm, unless we use one bbox for its initialization (it would allow us to save one OR operation at the cost of increasing code size, nothing more). res here is essentially a bunch of boolean flags, if bboxes is empty, it always must be "false".

@klensy
Copy link
Contributor

klensy commented Sep 14, 2021

Why UB? It looks correct, at least.

let mut res: [MaybeUninit<__m256i>; N] = MaybeUninit::uninit_array();
...
res[i].write(_mm256_or_si256(_mm256_setzero_si256(), t));
...
mem::transmute::<_, [__m256i; N]>(res)

@newpavlov
Copy link
Contributor Author

Ah, I thought you wanted to replace only the first line. But nevertheless, you algorithm (if I understand your proposal correctly) is still incorrect. You replace res[i] = _mm256_or_si256(res[i], t); with res[i].write(_mm256_or_si256(_mm256_setzero_si256(), t));, thus you remove accumulation and only the last bbox will determine results. And it still will result in UB for empty bboxes.

If you have corrections, then I think it would be better to post full snippets, preferably on godbolt.

@klensy
Copy link
Contributor

klensy commented Sep 14, 2021

You are right, forgot about cycles, my bad.

@nagisa nagisa added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Sep 15, 2021
@newpavlov
Copy link
Contributor Author

Can someone please create an LLVM issue for this problem? (registration is currently disabled there)

@piegamesde
Copy link
Contributor

Another example of a performance issue created by putting too much on the stack: Godbolt link

Basically the line causing issues is let t1 = add($rest, K32X4[$i]);. For every index of the (manually unrolled) loop iteration, K32X4 is accessed exactly once. The generated assembly looks like this:

example::compress:
        addi    sp, sp, -416 # 416 bytes of stack usage!
        … # <skipped function prelude>
        lui     a0, 16 # Two instructions to load the constant from K32
        addi    a0, a0, -256
        sw      a0, 260(sp) # Store it on the stack
        lui     a0, 272547 # Rince and repeat, for all 64 entries
        addi    a0, a0, -104
        sw      a0, 256(sp)

The function basically copies the entire array onto the stack, although every value is read only once. The compiler should either directly read a reference from the array, or inline the value where used (probably faster).

For reference, if you change the line to let t1 = add($rest, *k32($i));, with k32 being a wrapper function that forces runtime evaluation, the stack usage goes down to 176 bytes. Performance stays roughly the same despite the missed optimization opportunities. I'd guess that a correct optimization of the original code could lead about 20% performance gains here.

@piegamesde
Copy link
Contributor

The issue is still there even when manually unrolling all macro and method invocations. I now have code that looks like this, but repeated a lot of times with different constants:

        let b = [
            a[0].wrapping_add(0x0123),
            a[1].wrapping_add(0x4567),
            a[2].wrapping_add(0x89AB),
            a[3].wrapping_add(0xCDEF),
        ];

And the compiler still puts all of these constants onto the stack first. So it does immediate -> register -> stack -> register whereas immediate -> register would be all that's needed.

@newpavlov
Copy link
Contributor Author

It looks like this issue also causes unnecessary secrets leakage onto the stack in the aes crate.

The following Rust code:

#[no_mangle]
pub unsafe fn aes_enc(cipher: &aes::Aes256, blocks: &mut [aes::Block]) {
    use aes::cipher::BlockEncrypt;
    cipher.encrypt_blocks(blocks);
}

Compiled for x86-64 targets with RUSTFLAGS="-C target-feature=+aes" (to remove target feature autodetection for simplicity) produces the following assembly:

Click to expand
aes_enc:
	sub	rsp, 120
	mov	eax, edx
	and	eax, 7
	cmp	rdx, 8
	jb	.LBB0_3
	mov	rcx, rdx
	shr	rcx, 3
	movaps	xmm0, xmmword ptr [rdi]
	movaps	xmmword ptr [rsp + 96], xmm0
	movaps	xmm0, xmmword ptr [rdi + 16]
	movaps	xmmword ptr [rsp + 80], xmm0
	movaps	xmm0, xmmword ptr [rdi + 32]
	movaps	xmmword ptr [rsp + 64], xmm0
	movaps	xmm0, xmmword ptr [rdi + 48]
	movaps	xmmword ptr [rsp], xmm0
	movaps	xmm0, xmmword ptr [rdi + 64]
	movaps	xmmword ptr [rsp - 16], xmm0
	movaps	xmm0, xmmword ptr [rdi + 80]
	movaps	xmmword ptr [rsp - 64], xmm0
	movaps	xmm0, xmmword ptr [rdi + 96]
	movaps	xmmword ptr [rsp - 32], xmm0
	movaps	xmm0, xmmword ptr [rdi + 112]
	movaps	xmmword ptr [rsp + 48], xmm0
	movaps	xmm0, xmmword ptr [rdi + 128]
	movaps	xmmword ptr [rsp - 80], xmm0
	movaps	xmm0, xmmword ptr [rdi + 144]
	movaps	xmmword ptr [rsp - 48], xmm0
	movaps	xmm0, xmmword ptr [rdi + 160]
	movaps	xmmword ptr [rsp - 96], xmm0
	movaps	xmm0, xmmword ptr [rdi + 176]
	movaps	xmmword ptr [rsp - 112], xmm0
	movaps	xmm0, xmmword ptr [rdi + 192]
	movaps	xmmword ptr [rsp - 128], xmm0
	movaps	xmm0, xmmword ptr [rdi + 208]
	movaps	xmmword ptr [rsp + 32], xmm0
	movdqa	xmm0, xmmword ptr [rdi + 224]
	movdqa	xmmword ptr [rsp + 16], xmm0
	lea	r8, [rsi + 112]
	movdqa	xmm12, xmmword ptr [rsp + 32]
	movdqa	xmm7, xmmword ptr [rsp + 16]
	.p2align	4, 0x90
.LBB0_2:
	movdqu	xmm5, xmmword ptr [r8 - 112]
	movdqu	xmm6, xmmword ptr [r8 - 96]
	movdqu	xmm3, xmmword ptr [r8 - 80]
	movdqu	xmm4, xmmword ptr [r8 - 64]
	movdqu	xmm1, xmmword ptr [r8 - 48]
	movdqu	xmm2, xmmword ptr [r8 - 32]
	movdqu	xmm15, xmmword ptr [r8 - 16]
	movdqu	xmm8, xmmword ptr [r8]
	movdqa	xmm9, xmmword ptr [rsp + 96]
	pxor	xmm5, xmm9
	pxor	xmm6, xmm9
	movdqa	xmm10, xmmword ptr [rsp + 80]
	aesenc	xmm5, xmm10
	aesenc	xmm6, xmm10
	movdqa	xmm0, xmmword ptr [rsp + 64]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm11, xmm0
	movdqa	xmm0, xmmword ptr [rsp]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 16]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 64]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 32]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm13, xmm0
	movdqa	xmm14, xmmword ptr [rsp + 48]
	aesenc	xmm5, xmm14
	aesenc	xmm6, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 48]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm5, xmm0
	aesenc	xmm6, xmm0
	aesenc	xmm5, xmm12
	aesenc	xmm6, xmm12
	aesenclast	xmm5, xmm7
	aesenclast	xmm6, xmm7
	movdqu	xmmword ptr [r8 - 112], xmm5
	movdqu	xmmword ptr [r8 - 96], xmm6
	movdqa	xmm5, xmm9
	pxor	xmm3, xmm9
	pxor	xmm4, xmm9
	movdqa	xmm6, xmm10
	aesenc	xmm3, xmm10
	aesenc	xmm4, xmm10
	movdqa	xmm9, xmm11
	aesenc	xmm3, xmm11
	aesenc	xmm4, xmm11
	movdqa	xmm10, xmmword ptr [rsp]
	aesenc	xmm3, xmm10
	aesenc	xmm4, xmm10
	movdqa	xmm11, xmmword ptr [rsp - 16]
	aesenc	xmm3, xmm11
	aesenc	xmm4, xmm11
	movdqa	xmm0, xmmword ptr [rsp - 64]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	aesenc	xmm3, xmm13
	aesenc	xmm4, xmm13
	movdqa	xmm13, xmm14
	aesenc	xmm3, xmm14
	aesenc	xmm4, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	movdqa	xmm14, xmmword ptr [rsp - 48]
	aesenc	xmm3, xmm14
	aesenc	xmm4, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm3, xmm0
	aesenc	xmm4, xmm0
	aesenc	xmm3, xmm12
	aesenc	xmm4, xmm12
	aesenclast	xmm3, xmm7
	aesenclast	xmm4, xmm7
	movdqu	xmmword ptr [r8 - 80], xmm3
	movdqu	xmmword ptr [r8 - 64], xmm4
	pxor	xmm1, xmm5
	pxor	xmm2, xmm5
	aesenc	xmm1, xmm6
	aesenc	xmm2, xmm6
	aesenc	xmm1, xmm9
	aesenc	xmm2, xmm9
	aesenc	xmm1, xmm10
	aesenc	xmm2, xmm10
	aesenc	xmm1, xmm11
	aesenc	xmm2, xmm11
	movdqa	xmm3, xmmword ptr [rsp - 64]
	aesenc	xmm1, xmm3
	aesenc	xmm2, xmm3
	movdqa	xmm4, xmmword ptr [rsp - 32]
	aesenc	xmm1, xmm4
	aesenc	xmm2, xmm4
	aesenc	xmm1, xmm13
	aesenc	xmm2, xmm13
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	aesenc	xmm1, xmm14
	aesenc	xmm2, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm1, xmm0
	aesenc	xmm2, xmm0
	aesenc	xmm1, xmm12
	aesenc	xmm2, xmm12
	aesenclast	xmm1, xmm7
	aesenclast	xmm2, xmm7
	movdqu	xmmword ptr [r8 - 48], xmm1
	movdqu	xmmword ptr [r8 - 32], xmm2
	pxor	xmm15, xmm5
	pxor	xmm8, xmm5
	aesenc	xmm15, xmm6
	aesenc	xmm8, xmm6
	aesenc	xmm15, xmm9
	aesenc	xmm8, xmm9
	aesenc	xmm15, xmm10
	aesenc	xmm8, xmm10
	aesenc	xmm15, xmm11
	aesenc	xmm8, xmm11
	aesenc	xmm15, xmm3
	aesenc	xmm8, xmm3
	movdqa	xmm1, xmm4
	aesenc	xmm15, xmm4
	aesenc	xmm8, xmm4
	aesenc	xmm15, xmm13
	aesenc	xmm8, xmm13
	movdqa	xmm0, xmmword ptr [rsp - 80]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	aesenc	xmm15, xmm14
	aesenc	xmm8, xmm14
	movdqa	xmm0, xmmword ptr [rsp - 96]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 112]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	movdqa	xmm0, xmmword ptr [rsp - 128]
	aesenc	xmm15, xmm0
	aesenc	xmm8, xmm0
	aesenc	xmm15, xmm12
	aesenc	xmm8, xmm12
	aesenclast	xmm15, xmm7
	aesenclast	xmm8, xmm7
	movdqu	xmmword ptr [r8 - 16], xmm15
	movdqu	xmmword ptr [r8], xmm8
	sub	r8, -128
	dec	rcx
	jne	.LBB0_2
.LBB0_3:
	test	rax, rax
	je	.LBB0_11
	movabs	rcx, 1152921504606846968
	and	rdx, rcx
	shl	rdx, 4
	add	rsi, rdx
	movdqa	xmm14, xmmword ptr [rdi]
	movdqa	xmm13, xmmword ptr [rdi + 16]
	movdqa	xmm12, xmmword ptr [rdi + 32]
	movdqa	xmm11, xmmword ptr [rdi + 48]
	movdqa	xmm10, xmmword ptr [rdi + 64]
	movdqa	xmm9, xmmword ptr [rdi + 80]
	movdqa	xmm8, xmmword ptr [rdi + 96]
	movdqa	xmm7, xmmword ptr [rdi + 112]
	movdqa	xmm6, xmmword ptr [rdi + 128]
	movdqa	xmm5, xmmword ptr [rdi + 144]
	movdqa	xmm4, xmmword ptr [rdi + 160]
	movdqa	xmm3, xmmword ptr [rdi + 176]
	movdqa	xmm2, xmmword ptr [rdi + 192]
	movdqa	xmm1, xmmword ptr [rdi + 208]
	movdqa	xmm0, xmmword ptr [rdi + 224]
	movdqu	xmm15, xmmword ptr [rsi]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi], xmm15
	cmp	eax, 1
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 16]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 16], xmm15
	cmp	eax, 2
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 32]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 32], xmm15
	cmp	eax, 3
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 48]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 48], xmm15
	cmp	eax, 4
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 64]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 64], xmm15
	cmp	eax, 5
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 80]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 80], xmm15
	cmp	eax, 6
	je	.LBB0_11
	movdqu	xmm15, xmmword ptr [rsi + 96]
	pxor	xmm15, xmm14
	aesenc	xmm15, xmm13
	aesenc	xmm15, xmm12
	aesenc	xmm15, xmm11
	aesenc	xmm15, xmm10
	aesenc	xmm15, xmm9
	aesenc	xmm15, xmm8
	aesenc	xmm15, xmm7
	aesenc	xmm15, xmm6
	aesenc	xmm15, xmm5
	aesenc	xmm15, xmm4
	aesenc	xmm15, xmm3
	aesenc	xmm15, xmm2
	aesenc	xmm15, xmm1
	aesenclast	xmm15, xmm0
	movdqu	xmmword ptr [rsi + 96], xmm15
.LBB0_11:
	add	rsp, 120
	ret

We can see that the generated code copies expanded cipher keys onto the stack instead of using them directly through pointer stored in rdi. The keys are accessed through a shared reference, so the compiler knows that the keys can not change under its foot.

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants