LSRA spill cost issue

https://github.com/dotnet/runtime/pull/51901 changed loop cloning to scale the "slow" path cloned loop blocks to 1% of the original block weights. This led to a number of benchmark regressions, such as in ludcmp: https://github.com/dotnet/runtime/issues/52316.

The root cause of these regressions appears to be that poor resulting register allocation leads to the introduction of stack load/store due to reload and reg alloc resolution moves, to the hottest variable in the hottest inner loop.

In particular, for the `ludcmp` function, in the post-51901 compiler, we see this inner loop code:
```
G_M42486_IG40:        ; gcrefRegs=00008002 {rcx r15}, byrefRegs=00000200 {r9}, loop=IG40, byref, isz
 00007ffb`f15436f0 0004D0 4D8BE7               mov      r12, r15
 00007ffb`f15436f3 0004D3 8B5C2434             mov      ebx, dword ptr [rsp+34H] ; **** extra load
 00007ffb`f15436f7 0004D7 4863FB               movsxd   rdi, ebx
 00007ffb`f15436fa 0004DA C4C17B1054FC10       vmovsd   xmm2, qword ptr [r12+8*rdi+16]
 00007ffb`f1543701 0004E1 488B7CF910           mov      rdi, gword ptr [rcx+8*rdi+16]
 00007ffb`f1543706 0004E6 3B6F08               cmp      ebp, dword ptr [rdi+8]
 00007ffb`f1543709 0004E9 0F83CB080000         jae      G_M42486_IG98
 00007ffb`f154370f 0004EF C4A16B5954EF10       vmulsd   xmm2, xmm2, qword ptr [rdi+8*r13+16]
 00007ffb`f1543716 0004F6 C4E1735CCA           vsubsd   xmm1, xmm1, xmm2
 00007ffb`f154371b 0004FB FFC3                 inc      ebx
 00007ffb`f154371d 0004FD 3BDD                 cmp      ebx, ebp
 00007ffb`f154371f 0004FF 895C2434             mov      dword ptr [rsp+34H], ebx ; **** extra store
 00007ffb`f1543723 000503 7CCB                 jl       SHORT G_M42486_IG40
```

On entry to the loop, the hot variable V09 (stored at `rsp+34H`) gets spilled into a new predecessor block (created due to a critical edge split) so its `r12` register can be used by another variable. All the registers are in use at this point, and LSRA decided to spill V09, despite the fact it is the hottest variable and this is the hottest loop (with the highest block weight). Immediately after `r12` is defined, V09 is reloaded, now into `ebx`. at the bottom of the block, V09 needs to get stored back to the stack as a resolution move, since the head of this very block expects it there.

When LSRA goes to spill something for V26 (and thus spills V09), it uses the table of per-register spill costs computed in `LinearScan::updateSpillCost()`. This uses the `interval->recentRefPosition` for V09. The recentRefPosition is in a block with a low weight, thus causing LSRA to choose V09 to spill, as the lowest cost register to spill (with the spill cost affected by block weight).

If, instead, the `FAR_NEXT_REF` was evaluated first (it is not), then LSRA would have noticed that V09 is immediately used after this position, and would choose to spill something else.

Previously, before the block weights were scaled, the cost of spilling V09 was high, so it was not spilled. In the new code, the `recentRefPosition` is in a "cold path" block with scaled-down weight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LSRA spill cost issue #53703

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LSRA spill cost issue #53703

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions