[RISCV] Move VMV0 elimination past machine SSA opts #126850

lukel97 · 2025-02-12T04:27:42Z

This is the follow up to #125026 that keeps mask operands in virtual register form for as long as possible throughout the backend.

The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in.

The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.

llvmbot · 2025-02-12T04:28:17Z

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

This is the follow up to #125026 that keeps mask operands in virtual register form for as long as possible throughout the backend.

The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in.

The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.

Patch is 172.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126850.diff

35 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+2-6)
(modified) llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp (+2-3)
(modified) llvm/test/CodeGen/RISCV/O0-pipeline.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/O3-pipeline.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/commutable.ll (+23-46)
(modified) llvm/test/CodeGen/RISCV/rvv/copyprop.mir (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll (+56-70)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp.ll (-24)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-trunc-vp.ll (+641-312)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vpload.ll (+17-17)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vselect.ll (+16-8)
(modified) llvm/test/CodeGen/RISCV/rvv/floor-vp.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+3-3)
(modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+3-3)
(modified) llvm/test/CodeGen/RISCV/rvv/fshr-fshl-vp.ll (+23-23)
(modified) llvm/test/CodeGen/RISCV/rvv/nearbyint-vp.ll (+27-20)
(modified) llvm/test/CodeGen/RISCV/rvv/round-vp.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/roundeven-vp.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/roundtozero-vp.ll (+8-12)
(modified) llvm/test/CodeGen/RISCV/rvv/setcc-fp-vp.ll (+12-12)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-extract-last-active.ll (+52-44)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-reassociations.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vfadd-vp.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vfdiv-vp.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vfma-vp.ll (+84-94)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmax-vp.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmin-vp.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vfmul-vp.ll (+4-4)
(modified) llvm/test/CodeGen/RISCV/rvv/vfptrunc-vp.ll (+52-57)
(modified) llvm/test/CodeGen/RISCV/rvv/vfsub-vp.ll (+8-8)
(modified) llvm/test/CodeGen/RISCV/rvv/vpgather-sdnode.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vpload.ll (+9-9)
(modified) llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll (+36-32)
(modified) llvm/test/CodeGen/RISCV/rvv/vtrunc-vp.ll (+47-36)

diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 167dbb53c5950..b2b77584c414d 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -588,8 +588,6 @@ void RISCVPassConfig::addPreEmitPass2() {
 
 void RISCVPassConfig::addMachineSSAOptimization() {
   addPass(createRISCVVectorPeepholePass());
-  // TODO: Move this to pre regalloc
-  addPass(createRISCVVMV0EliminationPass());
 
   TargetPassConfig::addMachineSSAOptimization();
 
@@ -602,10 +600,6 @@ void RISCVPassConfig::addMachineSSAOptimization() {
 }
 
 void RISCVPassConfig::addPreRegAlloc() {
-  // TODO: Move this as late as possible before regalloc
-  if (TM->getOptLevel() == CodeGenOptLevel::None)
-    addPass(createRISCVVMV0EliminationPass());
-
   addPass(createRISCVPreRAExpandPseudoPass());
   if (TM->getOptLevel() != CodeGenOptLevel::None) {
     addPass(createRISCVMergeBaseOffsetOptPass());
@@ -619,6 +613,8 @@ void RISCVPassConfig::addPreRegAlloc() {
 
   if (TM->getOptLevel() != CodeGenOptLevel::None && EnableMachinePipeliner)
     addPass(&MachinePipelinerID);
+
+  addPass(createRISCVVMV0EliminationPass());
 }
 
 void RISCVPassConfig::addFastRegAlloc() {
diff --git a/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp b/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
index a2daa19614d0a..508768270f8b0 100644
--- a/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
@@ -131,10 +131,9 @@ bool RISCVVMV0Elimination::runOnMachineFunction(MachineFunction &MF) {
 
           // Peek through a single copy to match what isel does.
           if (MachineInstr *SrcMI = MRI.getVRegDef(Src);
-              SrcMI->isCopy() && SrcMI->getOperand(1).getReg().isVirtual()) {
-            assert(SrcMI->getOperand(1).getSubReg() == RISCV::NoSubRegister);
+              SrcMI->isCopy() && SrcMI->getOperand(1).getReg().isVirtual() &&
+              SrcMI->getOperand(1).getSubReg() == RISCV::NoSubRegister)
             Src = SrcMI->getOperand(1).getReg();
-          }
 
           BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(RISCV::COPY), RISCV::V0)
               .addReg(Src);
diff --git a/llvm/test/CodeGen/RISCV/O0-pipeline.ll b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
index a50c303819f23..f93cb65897210 100644
--- a/llvm/test/CodeGen/RISCV/O0-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
@@ -39,11 +39,11 @@
 ; CHECK-NEXT:       RISC-V DAG->DAG Pattern Instruction Selection
 ; CHECK-NEXT:       Finalize ISel and expand pseudo-instructions
 ; CHECK-NEXT:       Local Stack Slot Allocation
-; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       RISC-V Pre-RA pseudo instruction expansion pass
 ; CHECK-NEXT:       RISC-V Insert Read/Write CSR Pass
 ; CHECK-NEXT:       RISC-V Insert Write VXRM Pass
 ; CHECK-NEXT:       RISC-V Landing Pad Setup
+; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       Init Undef Pass
 ; CHECK-NEXT:       Eliminate PHI nodes for register allocation
 ; CHECK-NEXT:       Two-Address instruction pass
diff --git a/llvm/test/CodeGen/RISCV/O3-pipeline.ll b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
index 2646dfeca4eb6..a2b5e9c86a107 100644
--- a/llvm/test/CodeGen/RISCV/O3-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
@@ -97,7 +97,6 @@
 ; CHECK-NEXT:       RISC-V DAG->DAG Pattern Instruction Selection
 ; CHECK-NEXT:       Finalize ISel and expand pseudo-instructions
 ; CHECK-NEXT:       RISC-V Vector Peephole Optimization
-; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       Lazy Machine Block Frequency Analysis
 ; CHECK-NEXT:       Early Tail Duplication
 ; CHECK-NEXT:       Optimize machine instruction PHIs
@@ -128,6 +127,7 @@
 ; CHECK-NEXT:       RISC-V Insert Read/Write CSR Pass
 ; CHECK-NEXT:       RISC-V Insert Write VXRM Pass
 ; CHECK-NEXT:       RISC-V Landing Pad Setup
+; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       Detect Dead Lanes
 ; CHECK-NEXT:       Init Undef Pass
 ; CHECK-NEXT:       Process Implicit Definitions
diff --git a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
index 1b9c78a20ec3b..039266b169ab2 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, <
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
 ; CHECK-NEXT:    vfabs.v v24, v16, v0.t
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vl8r.v v24, (a2) # Unknown-size Folded Reload
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
 ; CHECK-NEXT:    vmflt.vf v6, v24, fa5, v0.t
 ; CHECK-NEXT:    fsrmi a2, 3
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
 ; CHECK-NEXT:    vfcvt.x.f.v v24, v16, v0.t
-; CHECK-NEXT:    addi a3, sp, 16
-; CHECK-NEXT:    vs8r.v v24, (a3) # Unknown-size Folded Spill
 ; CHECK-NEXT:    fsrm a2
-; CHECK-NEXT:    addi a2, sp, 16
-; CHECK-NEXT:    vl8r.v v24, (a2) # Unknown-size Folded Reload
 ; CHECK-NEXT:    vfcvt.f.x.v v24, v24, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
 ; CHECK-NEXT:    vfsgnj.vv v16, v24, v16, v0.t
-; CHECK-NEXT:    vs8r.v v16, (a2) # Unknown-size Folded Spill
 ; CHECK-NEXT:    bltu a0, a1, .LBB44_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    mv a0, a1
 ; CHECK-NEXT:  .LBB44_2:
 ; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT:    vfabs.v v16, v8, v0.t
+; CHECK-NEXT:    vfabs.v v24, v8, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
-; CHECK-NEXT:    vmflt.vf v7, v16, fa5, v0.t
+; CHECK-NEXT:    vmflt.vf v7, v24, fa5, v0.t
 ; CHECK-NEXT:    fsrmi a0, 3
 ; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
-; CHECK-NEXT:    vfcvt.x.f.v v16, v8, v0.t
+; CHECK-NEXT:    vfcvt.x.f.v v24, v8, v0.t
 ; CHECK-NEXT:    fsrm a0
-; CHECK-NEXT:    vfcvt.f.x.v v16, v16, v0.t
+; CHECK-NEXT:    vfcvt.f.x.v v24, v24, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
-; CHECK-NEXT:    vfsgnj.vv v8, v16, v8, v0.t
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vfsgnj.vv v8, v24, v8, v0.t
 ; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add sp, sp, a0
diff --git a/llvm/test/CodeGen/RISCV/rvv/commutable.ll b/llvm/test/CodeGen/RISCV/rvv/commutable.ll
index e26c467f025bd..5f35626120178 100644
--- a/llvm/test/CodeGen/RISCV/rvv/commutable.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/commutable.ll
@@ -26,10 +26,9 @@ define <vscale x 1 x i64> @commutable_vadd_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vadd_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vadd.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vadd.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vadd.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -59,10 +58,9 @@ define <vscale x 1 x i64> @commutable_vand_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vand_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vand.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vand.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vand.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vand.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -92,10 +90,9 @@ define <vscale x 1 x i64> @commutable_vor_vv_masked(<vscale x 1 x i64> %0, <vsca
 ; CHECK-LABEL: commutable_vor_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vor.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vor.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -125,10 +122,9 @@ define <vscale x 1 x i64> @commutable_vxor_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vxor_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vxor.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vxor.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vxor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vxor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -158,10 +154,9 @@ define <vscale x 1 x i1> @commutable_vmseq_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmseq_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmseq.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmseq.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e8, mf8, ta, ma
-; CHECK-NEXT:    vmxor.mm v0, v10, v8
+; CHECK-NEXT:    vmxor.mm v0, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i1> @llvm.riscv.vmseq.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2)
   %b = call <vscale x 1 x i1> @llvm.riscv.vmseq.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2)
@@ -191,10 +186,9 @@ define <vscale x 1 x i1> @commutable_vmsne_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmsne_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmsne.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmsne.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e8, mf8, ta, ma
-; CHECK-NEXT:    vmxor.mm v0, v10, v8
+; CHECK-NEXT:    vmxor.mm v0, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i1> @llvm.riscv.vmsne.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2)
   %b = call <vscale x 1 x i1> @llvm.riscv.vmsne.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2)
@@ -224,10 +218,9 @@ define <vscale x 1 x i64> @commutable_vmin_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmin_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmin.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmin.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmin.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmin.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -257,10 +250,9 @@ define <vscale x 1 x i64> @commutable_vminu_vv_masked(<vscale x 1 x i64> %0, <vs
 ; CHECK-LABEL: commutable_vminu_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vminu.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vminu.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vminu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vminu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -290,10 +282,9 @@ define <vscale x 1 x i64> @commutable_vmax_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmax_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmax.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmax.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmax.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmax.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -323,10 +314,9 @@ define <vscale x 1 x i64> @commutable_vmaxu_vv_masked(<vscale x 1 x i64> %0, <vs
 ; CHECK-LABEL: commutable_vmaxu_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmaxu.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmaxu.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmaxu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmaxu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -356,10 +346,9 @@ define <vscale x 1 x i64> @commutable_vmul_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmul_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmul.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmul.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmul.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmul.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -389,10 +378,9 @@ define <vscale x 1 x i64> @commutable_vmulh_vv_masked(<vscale x 1 x i64> %0, <vs
 ; CHECK-LABEL: commutable_vmulh_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmulh.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmulh.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmulh.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmulh.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -422,10 +410,9 @@ define <vscale x 1 x i64> @commutable_vmulhu_vv_masked(<vscale x 1 x i64> %0, <v
 ; CHECK-LABEL: commutable_vmulhu_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmulhu.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmulhu.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmulhu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmulhu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -456,9 +443,8 @@ define <vscale x 1 x i64> @commutable_vwadd_vv_masked(<vscale x 1 x i32> %0, <vs
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwadd.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwadd.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwadd.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vwadd.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -489,9 +475,8 @@ define <vscale x 1 x i64> @commutable_vwaddu_vv_masked(<vscale x 1 x i32> %0, <v
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwaddu.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwaddu.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwaddu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vwaddu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -522,9 +507,8 @@ define <vscale x 1 x i64> @commutable_vwmul_vv_masked(<vscale x 1 x i32> %0, <vs
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwmul.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwmul.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwmul.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vwmul.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -555,9 +539,8 @@ define <vscale x 1 x i64> @commutable_vwmulu_vv_masked(<vscale x 1 x i32> %0, <v
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwmulu.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwmulu.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwmulu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscal...
[truncated]

wangpc-pp · 2025-02-12T06:13:27Z

llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, <
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
 ; CHECK-NEXT:    vfabs.v v24, v16, v0.t
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill


Why would we reload the value right after the spill here?

Luke, have you had a chance to look into this? I don't know that this is a blocker, but this one is really weird. It appears to be the only reload from the stack slot - as in, we didn't need to spill anything here at all.

Yeah it's the same thing as this case here, definitely very weird: #126850 (comment)

My main suspicion is that the physical COPYs to $v0 throw everything off, possible due to it assigning a register hint. I had a previous idea to rejig the vmv0 elimination pass to instead emit COPYs to vmv0 around the uses, instead of $v0, and see if the register allocator likes that better. I'll give it another shot.

FWIW these test cases with the weird spilling are very artificial (LMUL 16 types) and I would pray we never see them in the real world.

wangpc-pp

I totally agree we should go further in this direction! I just found some potential opportunities to optimize. They may be not related to this PR, but this PR just uncovered them.

wangpc-pp · 2025-02-12T06:18:40Z

llvm/test/CodeGen/RISCV/rvv/nearbyint-vp.ll

 ; CHECK-NEXT:    sub sp, sp, a1
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb


This is a regression.

wangpc-pp · 2025-02-12T06:19:22Z

llvm/test/CodeGen/RISCV/rvv/round-vp.ll

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_round_nxv16f64(<vscale x 16 x double> %va, <vs
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
 ; CHECK-NEXT:    vfabs.v v24, v16, v0.t
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vl8r.v v24, (a2) # Unknown-size Folded Reload


Reload right after spill.

I looked into this but couldn't really figure out what the register allocator was doing. It's spilling the %14 here

288B $v0 = COPY %6:vr 304B %18:vr = COPY %6:vr 320B early-clobber %18:vr = nofpexcept PseudoVMFLT_VFPR64_M8_MASK %18:vr(tied-def 0), %14:vrm8nov0, %17:fpr64, $v0, %13:gprnox0, 6 336B %39:gpr = SwapFRMImm 4, implicit-def $frm, implicit $frm 352B $v0 = COPY %18:vr 368B %20:vrm8nov0 = nofpexcept PseudoVFCVT_X_F_V_M8_MASK undef %20:vrm8nov0(tied-def 0), %1:vrm8nov0, $v0, 4, %13:gprnox0, 6, 3, implicit $frm

wangpc-pp · 2025-02-12T06:20:03Z

llvm/test/CodeGen/RISCV/rvv/roundeven-vp.ll

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_roundeven_nxv16f64(<vscale x 16 x double> %va,
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
 ; CHECK-NEXT:    vfabs.v v24, v16, v0.t
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vl8r.v v24, (a2) # Unknown-size Folded Reload


wangpc-pp · 2025-02-12T06:20:12Z

llvm/test/CodeGen/RISCV/rvv/roundtozero-vp.ll

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_roundtozero_nxv16f64(<vscale x 16 x double> %v
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
 ; CHECK-NEXT:    vfabs.v v24, v16, v0.t
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vl8r.v v24, (a2) # Unknown-size Folded Reload


wangpc-pp · 2025-02-12T06:24:26Z

llvm/test/CodeGen/RISCV/rvv/vfadd-vp.ll

@@ -429,16 +429,16 @@ define <vscale x 32 x bfloat> @vfadd_vv_nxv32bf16(<vscale x 32 x bfloat> %va, <v
 ; CHECK-NEXT:    sltu a2, a0, a3
 ; CHECK-NEXT:    addi a2, a2, -1
 ; CHECK-NEXT:    and a2, a2, a3
-; CHECK-NEXT:    vmv4r.v v8, v16
+; CHECK-NEXT:    vmv8r.v v8, v16


This is a dead move since v8 will be overrided at line 447 vfwcvtbf16.f.f.v v8, v20, v0.t. Somwhow we don't remove this move?

lukel97 · 2025-02-18T11:49:29Z

Ping

preames

I'm a bit hesitant about moving forward with this one given the noted weird register allocation behavior in the tests. I see two paths forward:

Performance data which shows this is strongly net positive.
Investigating the regalloc cases, and fixing them in separate patches. In particular, the spill/reload with no need to actually spill seems... weird.

preames · 2025-02-18T19:17:43Z

llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, <
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
 ; CHECK-NEXT:    vfabs.v v24, v16, v0.t
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill


Luke, have you had a chance to look into this? I don't know that this is a blocker, but this one is really weird. It appears to be the only reload from the stack slot - as in, we didn't need to spill anything here at all.

lukel97 · 2025-02-19T09:13:14Z

I see two paths forward:

Performance data which shows this is strongly net positive.

I haven't been able to get any dynamic data yet, but I got some static results which line up with what I was expecting. There's a significant increase in the number of copies coalesced by MachineCSE:

Metric: machine-cse.NumCoalesces

Program            machine-cse.NumCoalesces              
                   lhs                      rhs     diff 
     538.imagick_r 1266.00                  2137.00 68.8%
     638.imagick_s 1266.00                  2137.00 68.8%
   631.deepsjeng_s  158.00                   230.00 45.6%
   531.deepsjeng_r  158.00                   230.00 45.6%
         519.lbm_r    3.00                     4.00 33.3%
         619.lbm_s    3.00                     4.00 33.3%
      511.povray_r  690.00                   865.00 25.4%
     526.blender_r 7891.00                  8539.00  8.2%
        625.x264_s 1732.00                  1828.00  5.5%
        525.x264_r 1732.00                  1828.00  5.5%
     520.omnetpp_r  614.00                   640.00  4.2%
     620.omnetpp_s  614.00                   640.00  4.2%
         544.nab_r  164.00                   168.00  2.4%
         644.nab_s  164.00                   168.00  2.4%
       641.leela_s  283.00                   289.00  2.1%
       541.leela_r  283.00                   289.00  2.1%
   500.perlbench_r  996.00                  1012.00  1.6%
   600.perlbench_s  996.00                  1012.00  1.6%
         602.gcc_s 5373.00                  5443.00  1.3%
         502.gcc_r 5373.00                  5443.00  1.3%
      510.parest_r 9374.00                  9484.00  1.2%
   623.xalancbmk_s 1861.00                  1873.00  0.6%
   523.xalancbmk_r 1861.00                  1873.00  0.6%
        508.namd_r 2245.00                  2246.00  0.0%
         605.mcf_s   34.00                    34.00  0.0%
          557.xz_r  181.00                   181.00  0.0%
         505.mcf_r   34.00                    34.00  0.0%
          657.xz_s  181.00                   181.00  0.0%
Geomean difference                                  11.5%

povray/imagick/blender have more vector sequences shrunk into if/else blocks:

Metric: machine-sink.NumSunk

Program            machine-sink.NumSunk                
                   lhs                  rhs       diff 
      511.povray_r   8836.00              9149.00  3.5%
     538.imagick_r  30541.00             30631.00  0.3%
     638.imagick_s  30541.00             30631.00  0.3%
     526.blender_r 119273.00            119323.00  0.0%
    ...
Geomean difference                                 0.1%

MachineLICM does more hoisting:

Program                                       machinelicm.NumHoisted                machinelicm.NumCSEed               
                                              lhs                    rhs      diff  lhs                  rhs      diff 
FP2017rate/519.lbm_r/519.lbm_r                  275.00                 277.00  0.7%    28.00                28.00  0.0%
FP2017speed/619.lbm_s/619.lbm_s                 280.00                 282.00  0.7%    32.00                32.00  0.0%
FP2017speed/638.imagick_s/638.imagick_s       21459.00               21569.00  0.5% 10250.00             10331.00  0.8%
FP2017rate/538.imagick_r/538.imagick_r        21459.00               21569.00  0.5% 10250.00             10331.00  0.8%
FP2017rate/526.blender_r/526.blender_r        76896.00               76973.00  0.1% 30323.00             30343.00  0.1%
INT2017rate/525.x264_r/525.x264_r              2953.00                2955.00  0.1%  1179.00              1179.00  0.0%
INT2017speed/625.x264_s/625.x264_s             2953.00                2955.00  0.1%  1179.00              1179.00  0.0%
FP2017rate/510.parest_r/510.parest_r          65432.00               65464.00  0.0% 28134.00             28137.00  0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s    5670.00                5670.00  0.0%  2019.00              2019.00  0.0%
INT2017speed/602.gcc_s/602.gcc_s              61200.00               61200.00  0.0% 26398.00             26398.00  0.0%
INT2017speed/605.mcf_s/605.mcf_s                244.00                 244.00  0.0%   107.00               107.00  0.0%
FP2017rate/508.namd_r/508.namd_r               2455.00                2455.00  0.0%  1384.00              1384.00  0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s    9640.00                9640.00  0.0%  3570.00              3570.00  0.0%
INT2017rate/557.xz_r/557.xz_r                  1034.00                1034.00  0.0%   345.00               345.00  0.0%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s    1305.00                1305.00  0.0%   426.00               426.00  0.0%
                           Geomean difference                                  0.1%                                0.1%

There's negligible impact on spills/reloads:

Program                                       regalloc.NumSpills                regalloc.NumReloads               
                                              lhs                rhs      diff  lhs                 rhs      diff 
FP2017rate/538.imagick_r/538.imagick_r         4124.00            4125.00  0.0% 10418.00            10396.00 -0.2%
FP2017speed/638.imagick_s/638.imagick_s        4124.00            4125.00  0.0% 10418.00            10396.00 -0.2%
FP2017rate/526.blender_r/526.blender_r        13483.00           13484.00  0.0% 27498.00            27499.00  0.0%
FP2017rate/508.namd_r/508.namd_r               6712.00            6712.00  0.0% 16664.00            16664.00  0.0%
INT2017rate/525.x264_r/525.x264_r              2214.00            2214.00  0.0%  4630.00             4630.00  0.0%
INT2017speed/641.leela_s/641.leela_s            319.00             319.00  0.0%   484.00              484.00  0.0%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s     355.00             355.00  0.0%   711.00              711.00  0.0%
INT2017speed/625.x264_s/625.x264_s             2214.00            2214.00  0.0%  4630.00             4630.00  0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s    1816.00            1816.00  0.0%  2982.00             2982.00  0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s     721.00             721.00  0.0%  1212.00             1212.00  0.0%
INT2017speed/605.mcf_s/605.mcf_s                124.00             124.00  0.0%   373.00              373.00  0.0%
INT2017spe...00.perlbench_s/600.perlbench_s    4389.00            4389.00  0.0%  9753.00             9753.00  0.0%
INT2017rate/557.xz_r/557.xz_r                   317.00             317.00  0.0%   611.00              611.00  0.0%
INT2017rate/541.leela_r/541.leela_r             319.00             319.00  0.0%   484.00              484.00  0.0%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r     355.00             355.00  0.0%   711.00              711.00  0.0%
                           Geomean difference                             -0.0%                              -0.0%

Investigating the regalloc cases, and fixing them in separate patches. In particular, the spill/reload with no need to actually spill seems... weird.

I was able to fix the spills (in this diff) by changing the $V0 COPYs to VMV0 COPYs, but with the big caveat that it required fixing the earlyclobber constraint we have on pseudos required for register group overlap rules. At the risk of going down a rabbit hole here I'll attach the patch. I'm not sure if it will be worth pursuing, but it's possible anyway:
earlyclobber.diff.txt

wangpc-pp

LGTM.
I tend to move this forward as these results seem to be attracting. :-)

preames · 2025-02-19T20:39:13Z

LGTM, remember to rebase.

This is the follow up to llvm#125026 that keeps mask operands in virtual register form for as long as possible throughout the backend. The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in. The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.

newgre · 2025-02-26T14:20:18Z

This change seems to cause an ICE when compiling the following code:

char d[];
int f;
void sigsetjmp();
#define a(b, c) sigsetjmp()
void e() {
  a(, );
  f = 0;
  for (; f < 7; f++) d[f] = f;
}

using this compile script:

#!/bin/bash

clang \
'-mtune=sifive-x280' \
-target \
riscv64-unknown-linux-musl \
'-march=rv64gcv1p0_zfh_zvfh_xsfvcp' \
-w \
-Os \
-c \
reduce_me.c \
-o \
/tmp/reduce_me.o

Could we roll back this change until this is fixed? Please do let me know if you need more info, or having trouble to reproduce.

lukel97 · 2025-02-26T14:39:36Z

@newgre Thanks for the minimal reproducer, I think I'll be able to commit a fix for this soon. The issue is to do with dead COPYs being left around. I would have thought that they would have been mopped up by another pass but it looks like when we've moved it this close to regalloc we have to do it ourselves.

newgre · 2025-02-26T14:53:39Z

Thanks for the quick response.
Would you mind rolling back per https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy as this breaks one of our internal toolchains. Otherwise we will soon be in a situation where crucial build jobs become non-functional, and we'd really like to avoid that.

preames · 2025-02-26T15:12:19Z

Thanks for the quick response. Would you mind rolling back per https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy as this breaks one of our internal toolchains. Otherwise we will soon be in a situation where crucial build jobs become non-functional, and we'd really like to avoid that.

Waiting a week to report a problem, and then asking for an immediate revert instead of a prompt fix is not a reasonable. Luke, please do revert if the fix turns out to be non-trivial, but up to you to make a judgement call here.

This fixes a crash reported at #126850 (comment), where we may leave around a COPY to vmv0 after peeking through it. Even though the COPY is dead, there's no pass between vmv0 elimination and regalloc that will delete it so regalloc will try to allocate something for it. The test showcasing this is added in vmv0-elimination.mir. Removing the dead COPY results in changes in spills in the >= LMUL 16 VP tests, but it's worth noting that these tests are very noisy and not representative of real world code.

lukel97 · 2025-02-26T15:15:41Z

Thanks for the quick response. Would you mind rolling back per https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy as this breaks one of our internal toolchains. Otherwise we will soon be in a situation where crucial build jobs become non-functional, and we'd really like to avoid that.

I've pushed a fix at 35bf925, can you try it out and let me know if that fixes the issue in your internal toolchain?

Given that the fix was pretty small I wanted to avoid reverting given that the diff in this PR is quite large and has been in tree for nearly a week now, I hope that makes sense.

newgre · 2025-02-26T15:42:23Z

I've rebuilt the compiler and re-ran the test and with your fix included, the ICE goes away. Thanks!

This fixes a crash reported at llvm/llvm-project#126850 (comment), where we may leave around a COPY to vmv0 after peeking through it. Even though the COPY is dead, there's no pass between vmv0 elimination and regalloc that will delete it so regalloc will try to allocate something for it. The test showcasing this is added in vmv0-elimination.mir. Removing the dead COPY results in changes in spills in the >= LMUL 16 VP tests, but it's worth noting that these tests are very noisy and not representative of real world code.

lukel97 requested review from preames, topperc and wangpc-pp February 12, 2025 04:27

llvmbot added the backend:RISC-V label Feb 12, 2025

wangpc-pp reviewed Feb 12, 2025

View reviewed changes

preames reviewed Feb 18, 2025

View reviewed changes

wangpc-pp approved these changes Feb 19, 2025

View reviewed changes

lukel97 force-pushed the move-vmv0-elimination-past-ssa-opt branch from 1669f34 to 2c41445 Compare February 20, 2025 04:15

lukel97 merged commit df96b56 into llvm:main Feb 20, 2025
5 of 7 checks passed

wangpc-pp mentioned this pull request Apr 23, 2025

[GreedyRegAlloc] Multiple spill reloads into same register without intermediate def/overwrite #135639

Open

[RISCV] Move VMV0 elimination past machine SSA opts #126850

[RISCV] Move VMV0 elimination past machine SSA opts #126850

Uh oh!

Conversation

lukel97 commented Feb 12, 2025

Uh oh!

llvmbot commented Feb 12, 2025

Uh oh!

wangpc-pp Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Feb 18, 2025

Uh oh!

preames left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangpc-pp left a comment

Choose a reason for hiding this comment

Uh oh!

preames commented Feb 19, 2025

Uh oh!

Uh oh!

newgre commented Feb 26, 2025

Uh oh!

lukel97 commented Feb 26, 2025

Uh oh!

newgre commented Feb 26, 2025

Uh oh!

preames commented Feb 26, 2025

Uh oh!

lukel97 commented Feb 26, 2025

Uh oh!

newgre commented Feb 26, 2025

Uh oh!

Uh oh!

wangpc-pp Feb 12, 2025 •

edited

Loading

lukel97 commented Feb 19, 2025 •

edited

Loading