-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Unnecessary stack usage #88930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Creating |
You get UB this way. And even before that, it's an incorrect algorithm, unless we use one |
Why UB? It looks correct, at least.
|
Ah, I thought you wanted to replace only the first line. But nevertheless, you algorithm (if I understand your proposal correctly) is still incorrect. You replace If you have corrections, then I think it would be better to post full snippets, preferably on godbolt. |
You are right, forgot about cycles, my bad. |
Can someone please create an LLVM issue for this problem? (registration is currently disabled there) |
Another example of a performance issue created by putting too much on the stack: Godbolt link Basically the line causing issues is example::compress:
addi sp, sp, -416 # 416 bytes of stack usage!
… # <skipped function prelude>
lui a0, 16 # Two instructions to load the constant from K32
addi a0, a0, -256
sw a0, 260(sp) # Store it on the stack
lui a0, 272547 # Rince and repeat, for all 64 entries
addi a0, a0, -104
sw a0, 256(sp)
… The function basically copies the entire array onto the stack, although every value is read only once. The compiler should either directly read a reference from the array, or inline the value where used (probably faster). For reference, if you change the line to |
The issue is still there even when manually unrolling all macro and method invocations. I now have code that looks like this, but repeated a lot of times with different constants: let b = [
a[0].wrapping_add(0x0123),
a[1].wrapping_add(0x4567),
a[2].wrapping_add(0x89AB),
a[3].wrapping_add(0xCDEF),
]; And the compiler still puts all of these constants onto the stack first. So it does |
It looks like this issue also causes unnecessary secrets leakage onto the stack in the The following Rust code: #[no_mangle]
pub unsafe fn aes_enc(cipher: &aes::Aes256, blocks: &mut [aes::Block]) {
use aes::cipher::BlockEncrypt;
cipher.encrypt_blocks(blocks);
} Compiled for x86-64 targets with Click to expandaes_enc:
sub rsp, 120
mov eax, edx
and eax, 7
cmp rdx, 8
jb .LBB0_3
mov rcx, rdx
shr rcx, 3
movaps xmm0, xmmword ptr [rdi]
movaps xmmword ptr [rsp + 96], xmm0
movaps xmm0, xmmword ptr [rdi + 16]
movaps xmmword ptr [rsp + 80], xmm0
movaps xmm0, xmmword ptr [rdi + 32]
movaps xmmword ptr [rsp + 64], xmm0
movaps xmm0, xmmword ptr [rdi + 48]
movaps xmmword ptr [rsp], xmm0
movaps xmm0, xmmword ptr [rdi + 64]
movaps xmmword ptr [rsp - 16], xmm0
movaps xmm0, xmmword ptr [rdi + 80]
movaps xmmword ptr [rsp - 64], xmm0
movaps xmm0, xmmword ptr [rdi + 96]
movaps xmmword ptr [rsp - 32], xmm0
movaps xmm0, xmmword ptr [rdi + 112]
movaps xmmword ptr [rsp + 48], xmm0
movaps xmm0, xmmword ptr [rdi + 128]
movaps xmmword ptr [rsp - 80], xmm0
movaps xmm0, xmmword ptr [rdi + 144]
movaps xmmword ptr [rsp - 48], xmm0
movaps xmm0, xmmword ptr [rdi + 160]
movaps xmmword ptr [rsp - 96], xmm0
movaps xmm0, xmmword ptr [rdi + 176]
movaps xmmword ptr [rsp - 112], xmm0
movaps xmm0, xmmword ptr [rdi + 192]
movaps xmmword ptr [rsp - 128], xmm0
movaps xmm0, xmmword ptr [rdi + 208]
movaps xmmword ptr [rsp + 32], xmm0
movdqa xmm0, xmmword ptr [rdi + 224]
movdqa xmmword ptr [rsp + 16], xmm0
lea r8, [rsi + 112]
movdqa xmm12, xmmword ptr [rsp + 32]
movdqa xmm7, xmmword ptr [rsp + 16]
.p2align 4, 0x90
.LBB0_2:
movdqu xmm5, xmmword ptr [r8 - 112]
movdqu xmm6, xmmword ptr [r8 - 96]
movdqu xmm3, xmmword ptr [r8 - 80]
movdqu xmm4, xmmword ptr [r8 - 64]
movdqu xmm1, xmmword ptr [r8 - 48]
movdqu xmm2, xmmword ptr [r8 - 32]
movdqu xmm15, xmmword ptr [r8 - 16]
movdqu xmm8, xmmword ptr [r8]
movdqa xmm9, xmmword ptr [rsp + 96]
pxor xmm5, xmm9
pxor xmm6, xmm9
movdqa xmm10, xmmword ptr [rsp + 80]
aesenc xmm5, xmm10
aesenc xmm6, xmm10
movdqa xmm0, xmmword ptr [rsp + 64]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm11, xmm0
movdqa xmm0, xmmword ptr [rsp]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm0, xmmword ptr [rsp - 16]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm0, xmmword ptr [rsp - 64]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm0, xmmword ptr [rsp - 32]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm13, xmm0
movdqa xmm14, xmmword ptr [rsp + 48]
aesenc xmm5, xmm14
aesenc xmm6, xmm14
movdqa xmm0, xmmword ptr [rsp - 80]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm0, xmmword ptr [rsp - 48]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm0, xmmword ptr [rsp - 96]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm0, xmmword ptr [rsp - 112]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
movdqa xmm0, xmmword ptr [rsp - 128]
aesenc xmm5, xmm0
aesenc xmm6, xmm0
aesenc xmm5, xmm12
aesenc xmm6, xmm12
aesenclast xmm5, xmm7
aesenclast xmm6, xmm7
movdqu xmmword ptr [r8 - 112], xmm5
movdqu xmmword ptr [r8 - 96], xmm6
movdqa xmm5, xmm9
pxor xmm3, xmm9
pxor xmm4, xmm9
movdqa xmm6, xmm10
aesenc xmm3, xmm10
aesenc xmm4, xmm10
movdqa xmm9, xmm11
aesenc xmm3, xmm11
aesenc xmm4, xmm11
movdqa xmm10, xmmword ptr [rsp]
aesenc xmm3, xmm10
aesenc xmm4, xmm10
movdqa xmm11, xmmword ptr [rsp - 16]
aesenc xmm3, xmm11
aesenc xmm4, xmm11
movdqa xmm0, xmmword ptr [rsp - 64]
aesenc xmm3, xmm0
aesenc xmm4, xmm0
aesenc xmm3, xmm13
aesenc xmm4, xmm13
movdqa xmm13, xmm14
aesenc xmm3, xmm14
aesenc xmm4, xmm14
movdqa xmm0, xmmword ptr [rsp - 80]
aesenc xmm3, xmm0
aesenc xmm4, xmm0
movdqa xmm14, xmmword ptr [rsp - 48]
aesenc xmm3, xmm14
aesenc xmm4, xmm14
movdqa xmm0, xmmword ptr [rsp - 96]
aesenc xmm3, xmm0
aesenc xmm4, xmm0
movdqa xmm0, xmmword ptr [rsp - 112]
aesenc xmm3, xmm0
aesenc xmm4, xmm0
movdqa xmm0, xmmword ptr [rsp - 128]
aesenc xmm3, xmm0
aesenc xmm4, xmm0
aesenc xmm3, xmm12
aesenc xmm4, xmm12
aesenclast xmm3, xmm7
aesenclast xmm4, xmm7
movdqu xmmword ptr [r8 - 80], xmm3
movdqu xmmword ptr [r8 - 64], xmm4
pxor xmm1, xmm5
pxor xmm2, xmm5
aesenc xmm1, xmm6
aesenc xmm2, xmm6
aesenc xmm1, xmm9
aesenc xmm2, xmm9
aesenc xmm1, xmm10
aesenc xmm2, xmm10
aesenc xmm1, xmm11
aesenc xmm2, xmm11
movdqa xmm3, xmmword ptr [rsp - 64]
aesenc xmm1, xmm3
aesenc xmm2, xmm3
movdqa xmm4, xmmword ptr [rsp - 32]
aesenc xmm1, xmm4
aesenc xmm2, xmm4
aesenc xmm1, xmm13
aesenc xmm2, xmm13
movdqa xmm0, xmmword ptr [rsp - 80]
aesenc xmm1, xmm0
aesenc xmm2, xmm0
aesenc xmm1, xmm14
aesenc xmm2, xmm14
movdqa xmm0, xmmword ptr [rsp - 96]
aesenc xmm1, xmm0
aesenc xmm2, xmm0
movdqa xmm0, xmmword ptr [rsp - 112]
aesenc xmm1, xmm0
aesenc xmm2, xmm0
movdqa xmm0, xmmword ptr [rsp - 128]
aesenc xmm1, xmm0
aesenc xmm2, xmm0
aesenc xmm1, xmm12
aesenc xmm2, xmm12
aesenclast xmm1, xmm7
aesenclast xmm2, xmm7
movdqu xmmword ptr [r8 - 48], xmm1
movdqu xmmword ptr [r8 - 32], xmm2
pxor xmm15, xmm5
pxor xmm8, xmm5
aesenc xmm15, xmm6
aesenc xmm8, xmm6
aesenc xmm15, xmm9
aesenc xmm8, xmm9
aesenc xmm15, xmm10
aesenc xmm8, xmm10
aesenc xmm15, xmm11
aesenc xmm8, xmm11
aesenc xmm15, xmm3
aesenc xmm8, xmm3
movdqa xmm1, xmm4
aesenc xmm15, xmm4
aesenc xmm8, xmm4
aesenc xmm15, xmm13
aesenc xmm8, xmm13
movdqa xmm0, xmmword ptr [rsp - 80]
aesenc xmm15, xmm0
aesenc xmm8, xmm0
aesenc xmm15, xmm14
aesenc xmm8, xmm14
movdqa xmm0, xmmword ptr [rsp - 96]
aesenc xmm15, xmm0
aesenc xmm8, xmm0
movdqa xmm0, xmmword ptr [rsp - 112]
aesenc xmm15, xmm0
aesenc xmm8, xmm0
movdqa xmm0, xmmword ptr [rsp - 128]
aesenc xmm15, xmm0
aesenc xmm8, xmm0
aesenc xmm15, xmm12
aesenc xmm8, xmm12
aesenclast xmm15, xmm7
aesenclast xmm8, xmm7
movdqu xmmword ptr [r8 - 16], xmm15
movdqu xmmword ptr [r8], xmm8
sub r8, -128
dec rcx
jne .LBB0_2
.LBB0_3:
test rax, rax
je .LBB0_11
movabs rcx, 1152921504606846968
and rdx, rcx
shl rdx, 4
add rsi, rdx
movdqa xmm14, xmmword ptr [rdi]
movdqa xmm13, xmmword ptr [rdi + 16]
movdqa xmm12, xmmword ptr [rdi + 32]
movdqa xmm11, xmmword ptr [rdi + 48]
movdqa xmm10, xmmword ptr [rdi + 64]
movdqa xmm9, xmmword ptr [rdi + 80]
movdqa xmm8, xmmword ptr [rdi + 96]
movdqa xmm7, xmmword ptr [rdi + 112]
movdqa xmm6, xmmword ptr [rdi + 128]
movdqa xmm5, xmmword ptr [rdi + 144]
movdqa xmm4, xmmword ptr [rdi + 160]
movdqa xmm3, xmmword ptr [rdi + 176]
movdqa xmm2, xmmword ptr [rdi + 192]
movdqa xmm1, xmmword ptr [rdi + 208]
movdqa xmm0, xmmword ptr [rdi + 224]
movdqu xmm15, xmmword ptr [rsi]
pxor xmm15, xmm14
aesenc xmm15, xmm13
aesenc xmm15, xmm12
aesenc xmm15, xmm11
aesenc xmm15, xmm10
aesenc xmm15, xmm9
aesenc xmm15, xmm8
aesenc xmm15, xmm7
aesenc xmm15, xmm6
aesenc xmm15, xmm5
aesenc xmm15, xmm4
aesenc xmm15, xmm3
aesenc xmm15, xmm2
aesenc xmm15, xmm1
aesenclast xmm15, xmm0
movdqu xmmword ptr [rsi], xmm15
cmp eax, 1
je .LBB0_11
movdqu xmm15, xmmword ptr [rsi + 16]
pxor xmm15, xmm14
aesenc xmm15, xmm13
aesenc xmm15, xmm12
aesenc xmm15, xmm11
aesenc xmm15, xmm10
aesenc xmm15, xmm9
aesenc xmm15, xmm8
aesenc xmm15, xmm7
aesenc xmm15, xmm6
aesenc xmm15, xmm5
aesenc xmm15, xmm4
aesenc xmm15, xmm3
aesenc xmm15, xmm2
aesenc xmm15, xmm1
aesenclast xmm15, xmm0
movdqu xmmword ptr [rsi + 16], xmm15
cmp eax, 2
je .LBB0_11
movdqu xmm15, xmmword ptr [rsi + 32]
pxor xmm15, xmm14
aesenc xmm15, xmm13
aesenc xmm15, xmm12
aesenc xmm15, xmm11
aesenc xmm15, xmm10
aesenc xmm15, xmm9
aesenc xmm15, xmm8
aesenc xmm15, xmm7
aesenc xmm15, xmm6
aesenc xmm15, xmm5
aesenc xmm15, xmm4
aesenc xmm15, xmm3
aesenc xmm15, xmm2
aesenc xmm15, xmm1
aesenclast xmm15, xmm0
movdqu xmmword ptr [rsi + 32], xmm15
cmp eax, 3
je .LBB0_11
movdqu xmm15, xmmword ptr [rsi + 48]
pxor xmm15, xmm14
aesenc xmm15, xmm13
aesenc xmm15, xmm12
aesenc xmm15, xmm11
aesenc xmm15, xmm10
aesenc xmm15, xmm9
aesenc xmm15, xmm8
aesenc xmm15, xmm7
aesenc xmm15, xmm6
aesenc xmm15, xmm5
aesenc xmm15, xmm4
aesenc xmm15, xmm3
aesenc xmm15, xmm2
aesenc xmm15, xmm1
aesenclast xmm15, xmm0
movdqu xmmword ptr [rsi + 48], xmm15
cmp eax, 4
je .LBB0_11
movdqu xmm15, xmmword ptr [rsi + 64]
pxor xmm15, xmm14
aesenc xmm15, xmm13
aesenc xmm15, xmm12
aesenc xmm15, xmm11
aesenc xmm15, xmm10
aesenc xmm15, xmm9
aesenc xmm15, xmm8
aesenc xmm15, xmm7
aesenc xmm15, xmm6
aesenc xmm15, xmm5
aesenc xmm15, xmm4
aesenc xmm15, xmm3
aesenc xmm15, xmm2
aesenc xmm15, xmm1
aesenclast xmm15, xmm0
movdqu xmmword ptr [rsi + 64], xmm15
cmp eax, 5
je .LBB0_11
movdqu xmm15, xmmword ptr [rsi + 80]
pxor xmm15, xmm14
aesenc xmm15, xmm13
aesenc xmm15, xmm12
aesenc xmm15, xmm11
aesenc xmm15, xmm10
aesenc xmm15, xmm9
aesenc xmm15, xmm8
aesenc xmm15, xmm7
aesenc xmm15, xmm6
aesenc xmm15, xmm5
aesenc xmm15, xmm4
aesenc xmm15, xmm3
aesenc xmm15, xmm2
aesenc xmm15, xmm1
aesenclast xmm15, xmm0
movdqu xmmword ptr [rsi + 80], xmm15
cmp eax, 6
je .LBB0_11
movdqu xmm15, xmmword ptr [rsi + 96]
pxor xmm15, xmm14
aesenc xmm15, xmm13
aesenc xmm15, xmm12
aesenc xmm15, xmm11
aesenc xmm15, xmm10
aesenc xmm15, xmm9
aesenc xmm15, xmm8
aesenc xmm15, xmm7
aesenc xmm15, xmm6
aesenc xmm15, xmm5
aesenc xmm15, xmm4
aesenc xmm15, xmm3
aesenc xmm15, xmm2
aesenc xmm15, xmm1
aesenclast xmm15, xmm0
movdqu xmmword ptr [rsi + 96], xmm15
.LBB0_11:
add rsp, 120
ret We can see that the generated code copies expanded cipher keys onto the stack instead of using them directly through pointer stored in |
I have the following simple SIMD-powered function which tests whether points are inside one of bounding boxes:
By inspecting the generated assembly we can see that for some reason it caches coordinates to stack and reads them from it each iteration instead of using the input pointers. The same behavior can be observed for a function which processes coordinate slices. This caching looks quite redundant to me, especially considering that
noalias
is enabled (i.e. compiler should know that memory at which coordinates are stored can not change during function execution).It looks like LLVM correctly moves coordinate loads from the inner loop using its infinite virtual registers. And it's exactly the behavior we want when there is enough physical registers. But when it's not true, it spills virtual register values to stack instead of relying on the original locations.
On Rust 1.51 code from the first link does not have this issue, but not from the second one.
UPD: See this comment for additional example affecting cryptographic code..
The text was updated successfully, but these errors were encountered: