Description
#51901 changed loop cloning to scale the "slow" path cloned loop blocks to 1% of the original block weights. This led to a number of benchmark regressions, such as in ludcmp: #52316.
The root cause of these regressions appears to be that poor resulting register allocation leads to the introduction of stack load/store due to reload and reg alloc resolution moves, to the hottest variable in the hottest inner loop.
In particular, for the ludcmp
function, in the post-51901 compiler, we see this inner loop code:
G_M42486_IG40: ; gcrefRegs=00008002 {rcx r15}, byrefRegs=00000200 {r9}, loop=IG40, byref, isz
00007ffb`f15436f0 0004D0 4D8BE7 mov r12, r15
00007ffb`f15436f3 0004D3 8B5C2434 mov ebx, dword ptr [rsp+34H] ; **** extra load
00007ffb`f15436f7 0004D7 4863FB movsxd rdi, ebx
00007ffb`f15436fa 0004DA C4C17B1054FC10 vmovsd xmm2, qword ptr [r12+8*rdi+16]
00007ffb`f1543701 0004E1 488B7CF910 mov rdi, gword ptr [rcx+8*rdi+16]
00007ffb`f1543706 0004E6 3B6F08 cmp ebp, dword ptr [rdi+8]
00007ffb`f1543709 0004E9 0F83CB080000 jae G_M42486_IG98
00007ffb`f154370f 0004EF C4A16B5954EF10 vmulsd xmm2, xmm2, qword ptr [rdi+8*r13+16]
00007ffb`f1543716 0004F6 C4E1735CCA vsubsd xmm1, xmm1, xmm2
00007ffb`f154371b 0004FB FFC3 inc ebx
00007ffb`f154371d 0004FD 3BDD cmp ebx, ebp
00007ffb`f154371f 0004FF 895C2434 mov dword ptr [rsp+34H], ebx ; **** extra store
00007ffb`f1543723 000503 7CCB jl SHORT G_M42486_IG40
On entry to the loop, the hot variable V09 (stored at rsp+34H
) gets spilled into a new predecessor block (created due to a critical edge split) so its r12
register can be used by another variable. All the registers are in use at this point, and LSRA decided to spill V09, despite the fact it is the hottest variable and this is the hottest loop (with the highest block weight). Immediately after r12
is defined, V09 is reloaded, now into ebx
. at the bottom of the block, V09 needs to get stored back to the stack as a resolution move, since the head of this very block expects it there.
When LSRA goes to spill something for V26 (and thus spills V09), it uses the table of per-register spill costs computed in LinearScan::updateSpillCost()
. This uses the interval->recentRefPosition
for V09. The recentRefPosition is in a block with a low weight, thus causing LSRA to choose V09 to spill, as the lowest cost register to spill (with the spill cost affected by block weight).
If, instead, the FAR_NEXT_REF
was evaluated first (it is not), then LSRA would have noticed that V09 is immediately used after this position, and would choose to spill something else.
Previously, before the block weights were scaled, the cost of spilling V09 was high, so it was not spilled. In the new code, the recentRefPosition
is in a "cold path" block with scaled-down weight.