Skip to content

[RISCV] Move VMV0 elimination past machine SSA opts #126850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Feb 12, 2025

This is the follow up to #125026 that keeps mask operands in virtual register form for as long as possible throughout the backend.

The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in.

The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.

@llvmbot
Copy link
Member

llvmbot commented Feb 12, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

This is the follow up to #125026 that keeps mask operands in virtual register form for as long as possible throughout the backend.

The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in.

The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.


Patch is 172.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126850.diff

35 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+2-6)
  • (modified) llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp (+2-3)
  • (modified) llvm/test/CodeGen/RISCV/O0-pipeline.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/O3-pipeline.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll (+8-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/commutable.ll (+23-46)
  • (modified) llvm/test/CodeGen/RISCV/rvv/copyprop.mir (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll (+56-70)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp.ll (-24)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-trunc-vp.ll (+641-312)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vpload.ll (+17-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vselect.ll (+16-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/floor-vp.ll (+8-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+3-3)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+3-3)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fshr-fshl-vp.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/nearbyint-vp.ll (+27-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/round-vp.ll (+8-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/roundeven-vp.ll (+8-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/roundtozero-vp.ll (+8-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/setcc-fp-vp.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-extract-last-active.ll (+52-44)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-reassociations.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfadd-vp.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfdiv-vp.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfma-vp.ll (+84-94)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmax-vp.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmin-vp.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmul-vp.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfptrunc-vp.ll (+52-57)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfsub-vp.ll (+8-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpgather-sdnode.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpload.ll (+9-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vreductions-fp-vp.ll (+36-32)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vtrunc-vp.ll (+47-36)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 167dbb53c5950..b2b77584c414d 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -588,8 +588,6 @@ void RISCVPassConfig::addPreEmitPass2() {
 
 void RISCVPassConfig::addMachineSSAOptimization() {
   addPass(createRISCVVectorPeepholePass());
-  // TODO: Move this to pre regalloc
-  addPass(createRISCVVMV0EliminationPass());
 
   TargetPassConfig::addMachineSSAOptimization();
 
@@ -602,10 +600,6 @@ void RISCVPassConfig::addMachineSSAOptimization() {
 }
 
 void RISCVPassConfig::addPreRegAlloc() {
-  // TODO: Move this as late as possible before regalloc
-  if (TM->getOptLevel() == CodeGenOptLevel::None)
-    addPass(createRISCVVMV0EliminationPass());
-
   addPass(createRISCVPreRAExpandPseudoPass());
   if (TM->getOptLevel() != CodeGenOptLevel::None) {
     addPass(createRISCVMergeBaseOffsetOptPass());
@@ -619,6 +613,8 @@ void RISCVPassConfig::addPreRegAlloc() {
 
   if (TM->getOptLevel() != CodeGenOptLevel::None && EnableMachinePipeliner)
     addPass(&MachinePipelinerID);
+
+  addPass(createRISCVVMV0EliminationPass());
 }
 
 void RISCVPassConfig::addFastRegAlloc() {
diff --git a/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp b/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
index a2daa19614d0a..508768270f8b0 100644
--- a/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
@@ -131,10 +131,9 @@ bool RISCVVMV0Elimination::runOnMachineFunction(MachineFunction &MF) {
 
           // Peek through a single copy to match what isel does.
           if (MachineInstr *SrcMI = MRI.getVRegDef(Src);
-              SrcMI->isCopy() && SrcMI->getOperand(1).getReg().isVirtual()) {
-            assert(SrcMI->getOperand(1).getSubReg() == RISCV::NoSubRegister);
+              SrcMI->isCopy() && SrcMI->getOperand(1).getReg().isVirtual() &&
+              SrcMI->getOperand(1).getSubReg() == RISCV::NoSubRegister)
             Src = SrcMI->getOperand(1).getReg();
-          }
 
           BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(RISCV::COPY), RISCV::V0)
               .addReg(Src);
diff --git a/llvm/test/CodeGen/RISCV/O0-pipeline.ll b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
index a50c303819f23..f93cb65897210 100644
--- a/llvm/test/CodeGen/RISCV/O0-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
@@ -39,11 +39,11 @@
 ; CHECK-NEXT:       RISC-V DAG->DAG Pattern Instruction Selection
 ; CHECK-NEXT:       Finalize ISel and expand pseudo-instructions
 ; CHECK-NEXT:       Local Stack Slot Allocation
-; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       RISC-V Pre-RA pseudo instruction expansion pass
 ; CHECK-NEXT:       RISC-V Insert Read/Write CSR Pass
 ; CHECK-NEXT:       RISC-V Insert Write VXRM Pass
 ; CHECK-NEXT:       RISC-V Landing Pad Setup
+; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       Init Undef Pass
 ; CHECK-NEXT:       Eliminate PHI nodes for register allocation
 ; CHECK-NEXT:       Two-Address instruction pass
diff --git a/llvm/test/CodeGen/RISCV/O3-pipeline.ll b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
index 2646dfeca4eb6..a2b5e9c86a107 100644
--- a/llvm/test/CodeGen/RISCV/O3-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
@@ -97,7 +97,6 @@
 ; CHECK-NEXT:       RISC-V DAG->DAG Pattern Instruction Selection
 ; CHECK-NEXT:       Finalize ISel and expand pseudo-instructions
 ; CHECK-NEXT:       RISC-V Vector Peephole Optimization
-; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       Lazy Machine Block Frequency Analysis
 ; CHECK-NEXT:       Early Tail Duplication
 ; CHECK-NEXT:       Optimize machine instruction PHIs
@@ -128,6 +127,7 @@
 ; CHECK-NEXT:       RISC-V Insert Read/Write CSR Pass
 ; CHECK-NEXT:       RISC-V Insert Write VXRM Pass
 ; CHECK-NEXT:       RISC-V Landing Pad Setup
+; CHECK-NEXT:       RISC-V VMV0 Elimination
 ; CHECK-NEXT:       Detect Dead Lanes
 ; CHECK-NEXT:       Init Undef Pass
 ; CHECK-NEXT:       Process Implicit Definitions
diff --git a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
index 1b9c78a20ec3b..039266b169ab2 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, <
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
 ; CHECK-NEXT:    vfabs.v v24, v16, v0.t
+; CHECK-NEXT:    addi a2, sp, 16
+; CHECK-NEXT:    vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vl8r.v v24, (a2) # Unknown-size Folded Reload
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
 ; CHECK-NEXT:    vmflt.vf v6, v24, fa5, v0.t
 ; CHECK-NEXT:    fsrmi a2, 3
 ; CHECK-NEXT:    vmv1r.v v0, v6
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
 ; CHECK-NEXT:    vfcvt.x.f.v v24, v16, v0.t
-; CHECK-NEXT:    addi a3, sp, 16
-; CHECK-NEXT:    vs8r.v v24, (a3) # Unknown-size Folded Spill
 ; CHECK-NEXT:    fsrm a2
-; CHECK-NEXT:    addi a2, sp, 16
-; CHECK-NEXT:    vl8r.v v24, (a2) # Unknown-size Folded Reload
 ; CHECK-NEXT:    vfcvt.f.x.v v24, v24, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
 ; CHECK-NEXT:    vfsgnj.vv v16, v24, v16, v0.t
-; CHECK-NEXT:    vs8r.v v16, (a2) # Unknown-size Folded Spill
 ; CHECK-NEXT:    bltu a0, a1, .LBB44_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    mv a0, a1
 ; CHECK-NEXT:  .LBB44_2:
 ; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT:    vfabs.v v16, v8, v0.t
+; CHECK-NEXT:    vfabs.v v24, v8, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
-; CHECK-NEXT:    vmflt.vf v7, v16, fa5, v0.t
+; CHECK-NEXT:    vmflt.vf v7, v24, fa5, v0.t
 ; CHECK-NEXT:    fsrmi a0, 3
 ; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
-; CHECK-NEXT:    vfcvt.x.f.v v16, v8, v0.t
+; CHECK-NEXT:    vfcvt.x.f.v v24, v8, v0.t
 ; CHECK-NEXT:    fsrm a0
-; CHECK-NEXT:    vfcvt.f.x.v v16, v16, v0.t
+; CHECK-NEXT:    vfcvt.f.x.v v24, v24, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m8, ta, mu
-; CHECK-NEXT:    vfsgnj.vv v8, v16, v8, v0.t
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT:    vfsgnj.vv v8, v24, v8, v0.t
 ; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add sp, sp, a0
diff --git a/llvm/test/CodeGen/RISCV/rvv/commutable.ll b/llvm/test/CodeGen/RISCV/rvv/commutable.ll
index e26c467f025bd..5f35626120178 100644
--- a/llvm/test/CodeGen/RISCV/rvv/commutable.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/commutable.ll
@@ -26,10 +26,9 @@ define <vscale x 1 x i64> @commutable_vadd_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vadd_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vadd.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vadd.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vadd.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -59,10 +58,9 @@ define <vscale x 1 x i64> @commutable_vand_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vand_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vand.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vand.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vand.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vand.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -92,10 +90,9 @@ define <vscale x 1 x i64> @commutable_vor_vv_masked(<vscale x 1 x i64> %0, <vsca
 ; CHECK-LABEL: commutable_vor_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vor.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vor.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -125,10 +122,9 @@ define <vscale x 1 x i64> @commutable_vxor_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vxor_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vxor.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vxor.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vxor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vxor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -158,10 +154,9 @@ define <vscale x 1 x i1> @commutable_vmseq_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmseq_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmseq.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmseq.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e8, mf8, ta, ma
-; CHECK-NEXT:    vmxor.mm v0, v10, v8
+; CHECK-NEXT:    vmxor.mm v0, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i1> @llvm.riscv.vmseq.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2)
   %b = call <vscale x 1 x i1> @llvm.riscv.vmseq.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2)
@@ -191,10 +186,9 @@ define <vscale x 1 x i1> @commutable_vmsne_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmsne_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmsne.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmsne.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e8, mf8, ta, ma
-; CHECK-NEXT:    vmxor.mm v0, v10, v8
+; CHECK-NEXT:    vmxor.mm v0, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i1> @llvm.riscv.vmsne.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2)
   %b = call <vscale x 1 x i1> @llvm.riscv.vmsne.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2)
@@ -224,10 +218,9 @@ define <vscale x 1 x i64> @commutable_vmin_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmin_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmin.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmin.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmin.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmin.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -257,10 +250,9 @@ define <vscale x 1 x i64> @commutable_vminu_vv_masked(<vscale x 1 x i64> %0, <vs
 ; CHECK-LABEL: commutable_vminu_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vminu.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vminu.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vminu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vminu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -290,10 +282,9 @@ define <vscale x 1 x i64> @commutable_vmax_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmax_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmax.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmax.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmax.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmax.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -323,10 +314,9 @@ define <vscale x 1 x i64> @commutable_vmaxu_vv_masked(<vscale x 1 x i64> %0, <vs
 ; CHECK-LABEL: commutable_vmaxu_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmaxu.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmaxu.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmaxu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmaxu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -356,10 +346,9 @@ define <vscale x 1 x i64> @commutable_vmul_vv_masked(<vscale x 1 x i64> %0, <vsc
 ; CHECK-LABEL: commutable_vmul_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmul.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmul.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmul.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmul.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -389,10 +378,9 @@ define <vscale x 1 x i64> @commutable_vmulh_vv_masked(<vscale x 1 x i64> %0, <vs
 ; CHECK-LABEL: commutable_vmulh_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmulh.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmulh.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmulh.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmulh.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -422,10 +410,9 @@ define <vscale x 1 x i64> @commutable_vmulhu_vv_masked(<vscale x 1 x i64> %0, <v
 ; CHECK-LABEL: commutable_vmulhu_vv_masked:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vmulhu.vv v10, v8, v9, v0.t
 ; CHECK-NEXT:    vmulhu.vv v8, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v8
+; CHECK-NEXT:    vadd.vv v8, v8, v8
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vmulhu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vmulhu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -456,9 +443,8 @@ define <vscale x 1 x i64> @commutable_vwadd_vv_masked(<vscale x 1 x i32> %0, <vs
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwadd.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwadd.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwadd.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vwadd.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -489,9 +475,8 @@ define <vscale x 1 x i64> @commutable_vwaddu_vv_masked(<vscale x 1 x i32> %0, <v
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwaddu.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwaddu.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwaddu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vwaddu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -522,9 +507,8 @@ define <vscale x 1 x i64> @commutable_vwmul_vv_masked(<vscale x 1 x i32> %0, <vs
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwmul.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwmul.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwmul.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
   %b = call <vscale x 1 x i64> @llvm.riscv.vwmul.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -555,9 +539,8 @@ define <vscale x 1 x i64> @commutable_vwmulu_vv_masked(<vscale x 1 x i32> %0, <v
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli zero, a0, e32, mf2, ta, ma
 ; CHECK-NEXT:    vwmulu.vv v10, v8, v9, v0.t
-; CHECK-NEXT:    vwmulu.vv v11, v8, v9, v0.t
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vadd.vv v8, v10, v11
+; CHECK-NEXT:    vadd.vv v8, v10, v10
 ; CHECK-NEXT:    ret
   %a = call <vscale x 1 x i64> @llvm.riscv.vwmulu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscal...
[truncated]

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, <
; CHECK-NEXT: vmv1r.v v0, v6
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma
; CHECK-NEXT: vfabs.v v24, v16, v0.t
; CHECK-NEXT: addi a2, sp, 16
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill
Copy link
Contributor

@wangpc-pp wangpc-pp Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we reload the value right after the spill here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luke, have you had a chance to look into this? I don't know that this is a blocker, but this one is really weird. It appears to be the only reload from the stack slot - as in, we didn't need to spill anything here at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's the same thing as this case here, definitely very weird: #126850 (comment)

My main suspicion is that the physical COPYs to $v0 throw everything off, possible due to it assigning a register hint. I had a previous idea to rejig the vmv0 elimination pass to instead emit COPYs to vmv0 around the uses, instead of $v0, and see if the register allocator likes that better. I'll give it another shot.

FWIW these test cases with the weird spilling are very artificial (LMUL 16 types) and I would pray we never see them in the real world.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree we should go further in this direction! I just found some potential opportunities to optimize. They may be not related to this PR, but this PR just uncovered them.

; CHECK-NEXT: sub sp, sp, a1
; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb
; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression.

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_round_nxv16f64(<vscale x 16 x double> %va, <vs
; CHECK-NEXT: vmv1r.v v0, v6
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma
; CHECK-NEXT: vfabs.v v24, v16, v0.t
; CHECK-NEXT: addi a2, sp, 16
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill
; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reload right after spill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this but couldn't really figure out what the register allocator was doing. It's spilling the %14 here

288B	  $v0 = COPY %6:vr
304B	  %18:vr = COPY %6:vr
320B	  early-clobber %18:vr = nofpexcept PseudoVMFLT_VFPR64_M8_MASK %18:vr(tied-def 0), %14:vrm8nov0, %17:fpr64, $v0, %13:gprnox0, 6
336B	  %39:gpr = SwapFRMImm 4, implicit-def $frm, implicit $frm
352B	  $v0 = COPY %18:vr
368B	  %20:vrm8nov0 = nofpexcept PseudoVFCVT_X_F_V_M8_MASK undef %20:vrm8nov0(tied-def 0), %1:vrm8nov0, $v0, 4, %13:gprnox0, 6, 3, implicit $frm

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_roundeven_nxv16f64(<vscale x 16 x double> %va,
; CHECK-NEXT: vmv1r.v v0, v6
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma
; CHECK-NEXT: vfabs.v v24, v16, v0.t
; CHECK-NEXT: addi a2, sp, 16
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill
; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_roundtozero_nxv16f64(<vscale x 16 x double> %v
; CHECK-NEXT: vmv1r.v v0, v6
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma
; CHECK-NEXT: vfabs.v v24, v16, v0.t
; CHECK-NEXT: addi a2, sp, 16
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill
; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -429,16 +429,16 @@ define <vscale x 32 x bfloat> @vfadd_vv_nxv32bf16(<vscale x 32 x bfloat> %va, <v
; CHECK-NEXT: sltu a2, a0, a3
; CHECK-NEXT: addi a2, a2, -1
; CHECK-NEXT: and a2, a2, a3
; CHECK-NEXT: vmv4r.v v8, v16
; CHECK-NEXT: vmv8r.v v8, v16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dead move since v8 will be overrided at line 447 vfwcvtbf16.f.f.v v8, v20, v0.t. Somwhow we don't remove this move?

@lukel97
Copy link
Contributor Author

lukel97 commented Feb 18, 2025

Ping

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant about moving forward with this one given the noted weird register allocation behavior in the tests. I see two paths forward:

  1. Performance data which shows this is strongly net positive.
  2. Investigating the regalloc cases, and fixing them in separate patches. In particular, the spill/reload with no need to actually spill seems... weird.

@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, <
; CHECK-NEXT: vmv1r.v v0, v6
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma
; CHECK-NEXT: vfabs.v v24, v16, v0.t
; CHECK-NEXT: addi a2, sp, 16
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luke, have you had a chance to look into this? I don't know that this is a blocker, but this one is really weird. It appears to be the only reload from the stack slot - as in, we didn't need to spill anything here at all.

@lukel97
Copy link
Contributor Author

lukel97 commented Feb 19, 2025

I see two paths forward:

  1. Performance data which shows this is strongly net positive.

I haven't been able to get any dynamic data yet, but I got some static results which line up with what I was expecting. There's a significant increase in the number of copies coalesced by MachineCSE:

Metric: machine-cse.NumCoalesces

Program            machine-cse.NumCoalesces              
                   lhs                      rhs     diff 
     538.imagick_r 1266.00                  2137.00 68.8%
     638.imagick_s 1266.00                  2137.00 68.8%
   631.deepsjeng_s  158.00                   230.00 45.6%
   531.deepsjeng_r  158.00                   230.00 45.6%
         519.lbm_r    3.00                     4.00 33.3%
         619.lbm_s    3.00                     4.00 33.3%
      511.povray_r  690.00                   865.00 25.4%
     526.blender_r 7891.00                  8539.00  8.2%
        625.x264_s 1732.00                  1828.00  5.5%
        525.x264_r 1732.00                  1828.00  5.5%
     520.omnetpp_r  614.00                   640.00  4.2%
     620.omnetpp_s  614.00                   640.00  4.2%
         544.nab_r  164.00                   168.00  2.4%
         644.nab_s  164.00                   168.00  2.4%
       641.leela_s  283.00                   289.00  2.1%
       541.leela_r  283.00                   289.00  2.1%
   500.perlbench_r  996.00                  1012.00  1.6%
   600.perlbench_s  996.00                  1012.00  1.6%
         602.gcc_s 5373.00                  5443.00  1.3%
         502.gcc_r 5373.00                  5443.00  1.3%
      510.parest_r 9374.00                  9484.00  1.2%
   623.xalancbmk_s 1861.00                  1873.00  0.6%
   523.xalancbmk_r 1861.00                  1873.00  0.6%
        508.namd_r 2245.00                  2246.00  0.0%
         605.mcf_s   34.00                    34.00  0.0%
          557.xz_r  181.00                   181.00  0.0%
         505.mcf_r   34.00                    34.00  0.0%
          657.xz_s  181.00                   181.00  0.0%
Geomean difference                                  11.5%

povray/imagick/blender have more vector sequences shrunk into if/else blocks:

Metric: machine-sink.NumSunk

Program            machine-sink.NumSunk                
                   lhs                  rhs       diff 
      511.povray_r   8836.00              9149.00  3.5%
     538.imagick_r  30541.00             30631.00  0.3%
     638.imagick_s  30541.00             30631.00  0.3%
     526.blender_r 119273.00            119323.00  0.0%
    ...
Geomean difference                                 0.1%

MachineLICM does more hoisting:

Program                                       machinelicm.NumHoisted                machinelicm.NumCSEed               
                                              lhs                    rhs      diff  lhs                  rhs      diff 
FP2017rate/519.lbm_r/519.lbm_r                  275.00                 277.00  0.7%    28.00                28.00  0.0%
FP2017speed/619.lbm_s/619.lbm_s                 280.00                 282.00  0.7%    32.00                32.00  0.0%
FP2017speed/638.imagick_s/638.imagick_s       21459.00               21569.00  0.5% 10250.00             10331.00  0.8%
FP2017rate/538.imagick_r/538.imagick_r        21459.00               21569.00  0.5% 10250.00             10331.00  0.8%
FP2017rate/526.blender_r/526.blender_r        76896.00               76973.00  0.1% 30323.00             30343.00  0.1%
INT2017rate/525.x264_r/525.x264_r              2953.00                2955.00  0.1%  1179.00              1179.00  0.0%
INT2017speed/625.x264_s/625.x264_s             2953.00                2955.00  0.1%  1179.00              1179.00  0.0%
FP2017rate/510.parest_r/510.parest_r          65432.00               65464.00  0.0% 28134.00             28137.00  0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s    5670.00                5670.00  0.0%  2019.00              2019.00  0.0%
INT2017speed/602.gcc_s/602.gcc_s              61200.00               61200.00  0.0% 26398.00             26398.00  0.0%
INT2017speed/605.mcf_s/605.mcf_s                244.00                 244.00  0.0%   107.00               107.00  0.0%
FP2017rate/508.namd_r/508.namd_r               2455.00                2455.00  0.0%  1384.00              1384.00  0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s    9640.00                9640.00  0.0%  3570.00              3570.00  0.0%
INT2017rate/557.xz_r/557.xz_r                  1034.00                1034.00  0.0%   345.00               345.00  0.0%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s    1305.00                1305.00  0.0%   426.00               426.00  0.0%
                           Geomean difference                                  0.1%                                0.1%

There's negligible impact on spills/reloads:

Program                                       regalloc.NumSpills                regalloc.NumReloads               
                                              lhs                rhs      diff  lhs                 rhs      diff 
FP2017rate/538.imagick_r/538.imagick_r         4124.00            4125.00  0.0% 10418.00            10396.00 -0.2%
FP2017speed/638.imagick_s/638.imagick_s        4124.00            4125.00  0.0% 10418.00            10396.00 -0.2%
FP2017rate/526.blender_r/526.blender_r        13483.00           13484.00  0.0% 27498.00            27499.00  0.0%
FP2017rate/508.namd_r/508.namd_r               6712.00            6712.00  0.0% 16664.00            16664.00  0.0%
INT2017rate/525.x264_r/525.x264_r              2214.00            2214.00  0.0%  4630.00             4630.00  0.0%
INT2017speed/641.leela_s/641.leela_s            319.00             319.00  0.0%   484.00              484.00  0.0%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s     355.00             355.00  0.0%   711.00              711.00  0.0%
INT2017speed/625.x264_s/625.x264_s             2214.00            2214.00  0.0%  4630.00             4630.00  0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s    1816.00            1816.00  0.0%  2982.00             2982.00  0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s     721.00             721.00  0.0%  1212.00             1212.00  0.0%
INT2017speed/605.mcf_s/605.mcf_s                124.00             124.00  0.0%   373.00              373.00  0.0%
INT2017spe...00.perlbench_s/600.perlbench_s    4389.00            4389.00  0.0%  9753.00             9753.00  0.0%
INT2017rate/557.xz_r/557.xz_r                   317.00             317.00  0.0%   611.00              611.00  0.0%
INT2017rate/541.leela_r/541.leela_r             319.00             319.00  0.0%   484.00              484.00  0.0%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r     355.00             355.00  0.0%   711.00              711.00  0.0%
                           Geomean difference                             -0.0%                              -0.0%

Investigating the regalloc cases, and fixing them in separate patches. In particular, the spill/reload with no need to actually spill seems... weird.

I was able to fix the spills (in this diff) by changing the $V0 COPYs to VMV0 COPYs, but with the big caveat that it required fixing the earlyclobber constraint we have on pseudos required for register group overlap rules. At the risk of going down a rabbit hole here I'll attach the patch. I'm not sure if it will be worth pursuing, but it's possible anyway:
earlyclobber.diff.txt

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I tend to move this forward as these results seem to be attracting. :-)

@preames
Copy link
Collaborator

preames commented Feb 19, 2025

LGTM, remember to rebase.

This is the follow up to llvm#125026 that keeps mask operands in virtual register form for as long as possible throughout the backend.

The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in.

The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.
@lukel97 lukel97 force-pushed the move-vmv0-elimination-past-ssa-opt branch from 1669f34 to 2c41445 Compare February 20, 2025 04:15
@lukel97 lukel97 merged commit df96b56 into llvm:main Feb 20, 2025
5 of 7 checks passed
@newgre
Copy link

newgre commented Feb 26, 2025

This change seems to cause an ICE when compiling the following code:

char d[];
int f;
void sigsetjmp();
#define a(b, c) sigsetjmp()
void e() {
  a(, );
  f = 0;
  for (; f < 7; f++) d[f] = f;
}

using this compile script:

#!/bin/bash

clang \
'-mtune=sifive-x280' \
-target \
riscv64-unknown-linux-musl \
'-march=rv64gcv1p0_zfh_zvfh_xsfvcp' \
-w \
-Os \
-c \
reduce_me.c \
-o \
/tmp/reduce_me.o

Could we roll back this change until this is fixed? Please do let me know if you need more info, or having trouble to reproduce.

@lukel97
Copy link
Contributor Author

lukel97 commented Feb 26, 2025

@newgre Thanks for the minimal reproducer, I think I'll be able to commit a fix for this soon. The issue is to do with dead COPYs being left around. I would have thought that they would have been mopped up by another pass but it looks like when we've moved it this close to regalloc we have to do it ourselves.

@newgre
Copy link

newgre commented Feb 26, 2025

Thanks for the quick response.
Would you mind rolling back per https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy as this breaks one of our internal toolchains. Otherwise we will soon be in a situation where crucial build jobs become non-functional, and we'd really like to avoid that.

@preames
Copy link
Collaborator

preames commented Feb 26, 2025

Thanks for the quick response. Would you mind rolling back per https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy as this breaks one of our internal toolchains. Otherwise we will soon be in a situation where crucial build jobs become non-functional, and we'd really like to avoid that.

Waiting a week to report a problem, and then asking for an immediate revert instead of a prompt fix is not a reasonable. Luke, please do revert if the fix turns out to be non-trivial, but up to you to make a judgement call here.

lukel97 added a commit that referenced this pull request Feb 26, 2025
This fixes a crash reported at
#126850 (comment),
where we may leave around a COPY to vmv0 after peeking through it.
Even though the COPY is dead, there's no pass between vmv0 elimination
and regalloc that will delete it so regalloc will try to allocate
something for it.

The test showcasing this is added in vmv0-elimination.mir. Removing
the dead COPY results in changes in spills in the >= LMUL 16 VP tests,
but it's worth noting that these tests are very noisy and not
representative of real world code.
@lukel97
Copy link
Contributor Author

lukel97 commented Feb 26, 2025

Thanks for the quick response. Would you mind rolling back per https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy as this breaks one of our internal toolchains. Otherwise we will soon be in a situation where crucial build jobs become non-functional, and we'd really like to avoid that.

I've pushed a fix at 35bf925, can you try it out and let me know if that fixes the issue in your internal toolchain?

Given that the fix was pretty small I wanted to avoid reverting given that the diff in this PR is quite large and has been in tree for nearly a week now, I hope that makes sense.

@newgre
Copy link

newgre commented Feb 26, 2025

I've rebuilt the compiler and re-ran the test and with your fix included, the ICE goes away. Thanks!

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Feb 26, 2025
This fixes a crash reported at
llvm/llvm-project#126850 (comment),
where we may leave around a COPY to vmv0 after peeking through it.
Even though the COPY is dead, there's no pass between vmv0 elimination
and regalloc that will delete it so regalloc will try to allocate
something for it.

The test showcasing this is added in vmv0-elimination.mir. Removing
the dead COPY results in changes in spills in the >= LMUL 16 VP tests,
but it's worth noting that these tests are very noisy and not
representative of real world code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants