-
Notifications
You must be signed in to change notification settings - Fork 14k
[RISCV] Move VMV0 elimination past machine SSA opts #126850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesThis is the follow up to #125026 that keeps mask operands in virtual register form for as long as possible throughout the backend. The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in. The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed. Patch is 172.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126850.diff 35 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 167dbb53c5950..b2b77584c414d 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -588,8 +588,6 @@ void RISCVPassConfig::addPreEmitPass2() {
void RISCVPassConfig::addMachineSSAOptimization() {
addPass(createRISCVVectorPeepholePass());
- // TODO: Move this to pre regalloc
- addPass(createRISCVVMV0EliminationPass());
TargetPassConfig::addMachineSSAOptimization();
@@ -602,10 +600,6 @@ void RISCVPassConfig::addMachineSSAOptimization() {
}
void RISCVPassConfig::addPreRegAlloc() {
- // TODO: Move this as late as possible before regalloc
- if (TM->getOptLevel() == CodeGenOptLevel::None)
- addPass(createRISCVVMV0EliminationPass());
-
addPass(createRISCVPreRAExpandPseudoPass());
if (TM->getOptLevel() != CodeGenOptLevel::None) {
addPass(createRISCVMergeBaseOffsetOptPass());
@@ -619,6 +613,8 @@ void RISCVPassConfig::addPreRegAlloc() {
if (TM->getOptLevel() != CodeGenOptLevel::None && EnableMachinePipeliner)
addPass(&MachinePipelinerID);
+
+ addPass(createRISCVVMV0EliminationPass());
}
void RISCVPassConfig::addFastRegAlloc() {
diff --git a/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp b/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
index a2daa19614d0a..508768270f8b0 100644
--- a/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVMV0Elimination.cpp
@@ -131,10 +131,9 @@ bool RISCVVMV0Elimination::runOnMachineFunction(MachineFunction &MF) {
// Peek through a single copy to match what isel does.
if (MachineInstr *SrcMI = MRI.getVRegDef(Src);
- SrcMI->isCopy() && SrcMI->getOperand(1).getReg().isVirtual()) {
- assert(SrcMI->getOperand(1).getSubReg() == RISCV::NoSubRegister);
+ SrcMI->isCopy() && SrcMI->getOperand(1).getReg().isVirtual() &&
+ SrcMI->getOperand(1).getSubReg() == RISCV::NoSubRegister)
Src = SrcMI->getOperand(1).getReg();
- }
BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(RISCV::COPY), RISCV::V0)
.addReg(Src);
diff --git a/llvm/test/CodeGen/RISCV/O0-pipeline.ll b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
index a50c303819f23..f93cb65897210 100644
--- a/llvm/test/CodeGen/RISCV/O0-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
@@ -39,11 +39,11 @@
; CHECK-NEXT: RISC-V DAG->DAG Pattern Instruction Selection
; CHECK-NEXT: Finalize ISel and expand pseudo-instructions
; CHECK-NEXT: Local Stack Slot Allocation
-; CHECK-NEXT: RISC-V VMV0 Elimination
; CHECK-NEXT: RISC-V Pre-RA pseudo instruction expansion pass
; CHECK-NEXT: RISC-V Insert Read/Write CSR Pass
; CHECK-NEXT: RISC-V Insert Write VXRM Pass
; CHECK-NEXT: RISC-V Landing Pad Setup
+; CHECK-NEXT: RISC-V VMV0 Elimination
; CHECK-NEXT: Init Undef Pass
; CHECK-NEXT: Eliminate PHI nodes for register allocation
; CHECK-NEXT: Two-Address instruction pass
diff --git a/llvm/test/CodeGen/RISCV/O3-pipeline.ll b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
index 2646dfeca4eb6..a2b5e9c86a107 100644
--- a/llvm/test/CodeGen/RISCV/O3-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
@@ -97,7 +97,6 @@
; CHECK-NEXT: RISC-V DAG->DAG Pattern Instruction Selection
; CHECK-NEXT: Finalize ISel and expand pseudo-instructions
; CHECK-NEXT: RISC-V Vector Peephole Optimization
-; CHECK-NEXT: RISC-V VMV0 Elimination
; CHECK-NEXT: Lazy Machine Block Frequency Analysis
; CHECK-NEXT: Early Tail Duplication
; CHECK-NEXT: Optimize machine instruction PHIs
@@ -128,6 +127,7 @@
; CHECK-NEXT: RISC-V Insert Read/Write CSR Pass
; CHECK-NEXT: RISC-V Insert Write VXRM Pass
; CHECK-NEXT: RISC-V Landing Pad Setup
+; CHECK-NEXT: RISC-V VMV0 Elimination
; CHECK-NEXT: Detect Dead Lanes
; CHECK-NEXT: Init Undef Pass
; CHECK-NEXT: Process Implicit Definitions
diff --git a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
index 1b9c78a20ec3b..039266b169ab2 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, <
; CHECK-NEXT: vmv1r.v v0, v6
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma
; CHECK-NEXT: vfabs.v v24, v16, v0.t
+; CHECK-NEXT: addi a2, sp, 16
+; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload
; CHECK-NEXT: vsetvli zero, zero, e64, m8, ta, mu
; CHECK-NEXT: vmflt.vf v6, v24, fa5, v0.t
; CHECK-NEXT: fsrmi a2, 3
; CHECK-NEXT: vmv1r.v v0, v6
; CHECK-NEXT: vsetvli zero, zero, e64, m8, ta, ma
; CHECK-NEXT: vfcvt.x.f.v v24, v16, v0.t
-; CHECK-NEXT: addi a3, sp, 16
-; CHECK-NEXT: vs8r.v v24, (a3) # Unknown-size Folded Spill
; CHECK-NEXT: fsrm a2
-; CHECK-NEXT: addi a2, sp, 16
-; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload
; CHECK-NEXT: vfcvt.f.x.v v24, v24, v0.t
; CHECK-NEXT: vsetvli zero, zero, e64, m8, ta, mu
; CHECK-NEXT: vfsgnj.vv v16, v24, v16, v0.t
-; CHECK-NEXT: vs8r.v v16, (a2) # Unknown-size Folded Spill
; CHECK-NEXT: bltu a0, a1, .LBB44_2
; CHECK-NEXT: # %bb.1:
; CHECK-NEXT: mv a0, a1
; CHECK-NEXT: .LBB44_2:
; CHECK-NEXT: vmv1r.v v0, v7
; CHECK-NEXT: vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT: vfabs.v v16, v8, v0.t
+; CHECK-NEXT: vfabs.v v24, v8, v0.t
; CHECK-NEXT: vsetvli zero, zero, e64, m8, ta, mu
-; CHECK-NEXT: vmflt.vf v7, v16, fa5, v0.t
+; CHECK-NEXT: vmflt.vf v7, v24, fa5, v0.t
; CHECK-NEXT: fsrmi a0, 3
; CHECK-NEXT: vmv1r.v v0, v7
; CHECK-NEXT: vsetvli zero, zero, e64, m8, ta, ma
-; CHECK-NEXT: vfcvt.x.f.v v16, v8, v0.t
+; CHECK-NEXT: vfcvt.x.f.v v24, v8, v0.t
; CHECK-NEXT: fsrm a0
-; CHECK-NEXT: vfcvt.f.x.v v16, v16, v0.t
+; CHECK-NEXT: vfcvt.f.x.v v24, v24, v0.t
; CHECK-NEXT: vsetvli zero, zero, e64, m8, ta, mu
-; CHECK-NEXT: vfsgnj.vv v8, v16, v8, v0.t
-; CHECK-NEXT: addi a0, sp, 16
-; CHECK-NEXT: vl8r.v v16, (a0) # Unknown-size Folded Reload
+; CHECK-NEXT: vfsgnj.vv v8, v24, v8, v0.t
; CHECK-NEXT: csrr a0, vlenb
; CHECK-NEXT: slli a0, a0, 3
; CHECK-NEXT: add sp, sp, a0
diff --git a/llvm/test/CodeGen/RISCV/rvv/commutable.ll b/llvm/test/CodeGen/RISCV/rvv/commutable.ll
index e26c467f025bd..5f35626120178 100644
--- a/llvm/test/CodeGen/RISCV/rvv/commutable.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/commutable.ll
@@ -26,10 +26,9 @@ define <vscale x 1 x i64> @commutable_vadd_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vadd_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v10, v8, v9, v0.t
; CHECK-NEXT: vadd.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vadd.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vadd.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -59,10 +58,9 @@ define <vscale x 1 x i64> @commutable_vand_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vand_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vand.vv v10, v8, v9, v0.t
; CHECK-NEXT: vand.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vand.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vand.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -92,10 +90,9 @@ define <vscale x 1 x i64> @commutable_vor_vv_masked(<vscale x 1 x i64> %0, <vsca
; CHECK-LABEL: commutable_vor_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vor.vv v10, v8, v9, v0.t
; CHECK-NEXT: vor.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -125,10 +122,9 @@ define <vscale x 1 x i64> @commutable_vxor_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vxor_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vxor.vv v10, v8, v9, v0.t
; CHECK-NEXT: vxor.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vxor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vxor.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -158,10 +154,9 @@ define <vscale x 1 x i1> @commutable_vmseq_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vmseq_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmseq.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmseq.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e8, mf8, ta, ma
-; CHECK-NEXT: vmxor.mm v0, v10, v8
+; CHECK-NEXT: vmxor.mm v0, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i1> @llvm.riscv.vmseq.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2)
%b = call <vscale x 1 x i1> @llvm.riscv.vmseq.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2)
@@ -191,10 +186,9 @@ define <vscale x 1 x i1> @commutable_vmsne_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vmsne_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmsne.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmsne.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e8, mf8, ta, ma
-; CHECK-NEXT: vmxor.mm v0, v10, v8
+; CHECK-NEXT: vmxor.mm v0, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i1> @llvm.riscv.vmsne.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2)
%b = call <vscale x 1 x i1> @llvm.riscv.vmsne.mask.nxv1i64(<vscale x 1 x i1> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2)
@@ -224,10 +218,9 @@ define <vscale x 1 x i64> @commutable_vmin_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vmin_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmin.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmin.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vmin.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vmin.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -257,10 +250,9 @@ define <vscale x 1 x i64> @commutable_vminu_vv_masked(<vscale x 1 x i64> %0, <vs
; CHECK-LABEL: commutable_vminu_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vminu.vv v10, v8, v9, v0.t
; CHECK-NEXT: vminu.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vminu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vminu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -290,10 +282,9 @@ define <vscale x 1 x i64> @commutable_vmax_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vmax_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmax.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmax.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vmax.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vmax.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -323,10 +314,9 @@ define <vscale x 1 x i64> @commutable_vmaxu_vv_masked(<vscale x 1 x i64> %0, <vs
; CHECK-LABEL: commutable_vmaxu_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmaxu.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmaxu.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vmaxu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vmaxu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -356,10 +346,9 @@ define <vscale x 1 x i64> @commutable_vmul_vv_masked(<vscale x 1 x i64> %0, <vsc
; CHECK-LABEL: commutable_vmul_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmul.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmul.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vmul.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vmul.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -389,10 +378,9 @@ define <vscale x 1 x i64> @commutable_vmulh_vv_masked(<vscale x 1 x i64> %0, <vs
; CHECK-LABEL: commutable_vmulh_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmulh.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmulh.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vmulh.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vmulh.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -422,10 +410,9 @@ define <vscale x 1 x i64> @commutable_vmulhu_vv_masked(<vscale x 1 x i64> %0, <v
; CHECK-LABEL: commutable_vmulhu_vv_masked:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vmulhu.vv v10, v8, v9, v0.t
; CHECK-NEXT: vmulhu.vv v8, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v8
+; CHECK-NEXT: vadd.vv v8, v8, v8
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vmulhu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %0, <vscale x 1 x i64> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vmulhu.mask.nxv1i64.nxv1i64(<vscale x 1 x i64> undef, <vscale x 1 x i64> %1, <vscale x 1 x i64> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -456,9 +443,8 @@ define <vscale x 1 x i64> @commutable_vwadd_vv_masked(<vscale x 1 x i32> %0, <vs
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
; CHECK-NEXT: vwadd.vv v10, v8, v9, v0.t
-; CHECK-NEXT: vwadd.vv v11, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v11
+; CHECK-NEXT: vadd.vv v8, v10, v10
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vwadd.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vwadd.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -489,9 +475,8 @@ define <vscale x 1 x i64> @commutable_vwaddu_vv_masked(<vscale x 1 x i32> %0, <v
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
; CHECK-NEXT: vwaddu.vv v10, v8, v9, v0.t
-; CHECK-NEXT: vwaddu.vv v11, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v11
+; CHECK-NEXT: vadd.vv v8, v10, v10
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vwaddu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vwaddu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -522,9 +507,8 @@ define <vscale x 1 x i64> @commutable_vwmul_vv_masked(<vscale x 1 x i32> %0, <vs
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
; CHECK-NEXT: vwmul.vv v10, v8, v9, v0.t
-; CHECK-NEXT: vwmul.vv v11, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v11
+; CHECK-NEXT: vadd.vv v8, v10, v10
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vwmul.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
%b = call <vscale x 1 x i64> @llvm.riscv.vwmul.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %1, <vscale x 1 x i32> %0, <vscale x 1 x i1> %mask, iXLen %2, iXLen 1)
@@ -555,9 +539,8 @@ define <vscale x 1 x i64> @commutable_vwmulu_vv_masked(<vscale x 1 x i32> %0, <v
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli zero, a0, e32, mf2, ta, ma
; CHECK-NEXT: vwmulu.vv v10, v8, v9, v0.t
-; CHECK-NEXT: vwmulu.vv v11, v8, v9, v0.t
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vadd.vv v8, v10, v11
+; CHECK-NEXT: vadd.vv v8, v10, v10
; CHECK-NEXT: ret
%a = call <vscale x 1 x i64> @llvm.riscv.vwmulu.mask.nxv1i64.nxv1i32.nxv1i32(<vscale x 1 x i64> undef, <vscale x 1 x i32> %0, <vscale x 1 x i32> %1, <vscal...
[truncated]
|
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, < | |||
; CHECK-NEXT: vmv1r.v v0, v6 | |||
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma | |||
; CHECK-NEXT: vfabs.v v24, v16, v0.t | |||
; CHECK-NEXT: addi a2, sp, 16 | |||
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we reload the value right after the spill here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luke, have you had a chance to look into this? I don't know that this is a blocker, but this one is really weird. It appears to be the only reload from the stack slot - as in, we didn't need to spill anything here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's the same thing as this case here, definitely very weird: #126850 (comment)
My main suspicion is that the physical COPYs to $v0 throw everything off, possible due to it assigning a register hint. I had a previous idea to rejig the vmv0 elimination pass to instead emit COPYs to vmv0 around the uses, instead of $v0, and see if the register allocator likes that better. I'll give it another shot.
FWIW these test cases with the weird spilling are very artificial (LMUL 16 types) and I would pray we never see them in the real world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree we should go further in this direction! I just found some potential opportunities to optimize. They may be not related to this PR, but this PR just uncovered them.
; CHECK-NEXT: sub sp, sp, a1 | ||
; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 8 * vlenb | ||
; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression.
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_round_nxv16f64(<vscale x 16 x double> %va, <vs | |||
; CHECK-NEXT: vmv1r.v v0, v6 | |||
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma | |||
; CHECK-NEXT: vfabs.v v24, v16, v0.t | |||
; CHECK-NEXT: addi a2, sp, 16 | |||
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill | |||
; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reload right after spill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this but couldn't really figure out what the register allocator was doing. It's spilling the %14 here
288B $v0 = COPY %6:vr
304B %18:vr = COPY %6:vr
320B early-clobber %18:vr = nofpexcept PseudoVMFLT_VFPR64_M8_MASK %18:vr(tied-def 0), %14:vrm8nov0, %17:fpr64, $v0, %13:gprnox0, 6
336B %39:gpr = SwapFRMImm 4, implicit-def $frm, implicit $frm
352B $v0 = COPY %18:vr
368B %20:vrm8nov0 = nofpexcept PseudoVFCVT_X_F_V_M8_MASK undef %20:vrm8nov0(tied-def 0), %1:vrm8nov0, $v0, 4, %13:gprnox0, 6, 3, implicit $frm
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_roundeven_nxv16f64(<vscale x 16 x double> %va, | |||
; CHECK-NEXT: vmv1r.v v0, v6 | |||
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma | |||
; CHECK-NEXT: vfabs.v v24, v16, v0.t | |||
; CHECK-NEXT: addi a2, sp, 16 | |||
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill | |||
; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_roundtozero_nxv16f64(<vscale x 16 x double> %v | |||
; CHECK-NEXT: vmv1r.v v0, v6 | |||
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma | |||
; CHECK-NEXT: vfabs.v v24, v16, v0.t | |||
; CHECK-NEXT: addi a2, sp, 16 | |||
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill | |||
; CHECK-NEXT: vl8r.v v24, (a2) # Unknown-size Folded Reload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -429,16 +429,16 @@ define <vscale x 32 x bfloat> @vfadd_vv_nxv32bf16(<vscale x 32 x bfloat> %va, <v | |||
; CHECK-NEXT: sltu a2, a0, a3 | |||
; CHECK-NEXT: addi a2, a2, -1 | |||
; CHECK-NEXT: and a2, a2, a3 | |||
; CHECK-NEXT: vmv4r.v v8, v16 | |||
; CHECK-NEXT: vmv8r.v v8, v16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a dead move since v8 will be overrided at line 447 vfwcvtbf16.f.f.v v8, v20, v0.t
. Somwhow we don't remove this move?
Ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit hesitant about moving forward with this one given the noted weird register allocation behavior in the tests. I see two paths forward:
- Performance data which shows this is strongly net positive.
- Investigating the regalloc cases, and fixing them in separate patches. In particular, the spill/reload with no need to actually spill seems... weird.
@@ -1515,40 +1515,36 @@ define <vscale x 16 x double> @vp_ceil_vv_nxv16f64(<vscale x 16 x double> %va, < | |||
; CHECK-NEXT: vmv1r.v v0, v6 | |||
; CHECK-NEXT: vsetvli zero, a2, e64, m8, ta, ma | |||
; CHECK-NEXT: vfabs.v v24, v16, v0.t | |||
; CHECK-NEXT: addi a2, sp, 16 | |||
; CHECK-NEXT: vs8r.v v24, (a2) # Unknown-size Folded Spill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luke, have you had a chance to look into this? I don't know that this is a blocker, but this one is really weird. It appears to be the only reload from the stack slot - as in, we didn't need to spill anything here at all.
I haven't been able to get any dynamic data yet, but I got some static results which line up with what I was expecting. There's a significant increase in the number of copies coalesced by MachineCSE:
povray/imagick/blender have more vector sequences shrunk into if/else blocks:
MachineLICM does more hoisting:
There's negligible impact on spills/reloads:
I was able to fix the spills (in this diff) by changing the $V0 COPYs to VMV0 COPYs, but with the big caveat that it required fixing the earlyclobber constraint we have on pseudos required for register group overlap rules. At the risk of going down a rabbit hole here I'll attach the patch. I'm not sure if it will be worth pursuing, but it's possible anyway: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I tend to move this forward as these results seem to be attracting. :-)
LGTM, remember to rebase. |
This is the follow up to llvm#125026 that keeps mask operands in virtual register form for as long as possible throughout the backend. The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in. The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.
1669f34
to
2c41445
Compare
This change seems to cause an ICE when compiling the following code: char d[];
int f;
void sigsetjmp();
#define a(b, c) sigsetjmp()
void e() {
a(, );
f = 0;
for (; f < 7; f++) d[f] = f;
} using this compile script: #!/bin/bash
clang \
'-mtune=sifive-x280' \
-target \
riscv64-unknown-linux-musl \
'-march=rv64gcv1p0_zfh_zvfh_xsfvcp' \
-w \
-Os \
-c \
reduce_me.c \
-o \
/tmp/reduce_me.o Could we roll back this change until this is fixed? Please do let me know if you need more info, or having trouble to reproduce. |
@newgre Thanks for the minimal reproducer, I think I'll be able to commit a fix for this soon. The issue is to do with dead COPYs being left around. I would have thought that they would have been mopped up by another pass but it looks like when we've moved it this close to regalloc we have to do it ourselves. |
Thanks for the quick response. |
Waiting a week to report a problem, and then asking for an immediate revert instead of a prompt fix is not a reasonable. Luke, please do revert if the fix turns out to be non-trivial, but up to you to make a judgement call here. |
This fixes a crash reported at #126850 (comment), where we may leave around a COPY to vmv0 after peeking through it. Even though the COPY is dead, there's no pass between vmv0 elimination and regalloc that will delete it so regalloc will try to allocate something for it. The test showcasing this is added in vmv0-elimination.mir. Removing the dead COPY results in changes in spills in the >= LMUL 16 VP tests, but it's worth noting that these tests are very noisy and not representative of real world code.
I've pushed a fix at 35bf925, can you try it out and let me know if that fixes the issue in your internal toolchain? Given that the fix was pretty small I wanted to avoid reverting given that the diff in this PR is quite large and has been in tree for nearly a week now, I hope that makes sense. |
I've rebuilt the compiler and re-ran the test and with your fix included, the ICE goes away. Thanks! |
This fixes a crash reported at llvm/llvm-project#126850 (comment), where we may leave around a COPY to vmv0 after peeking through it. Even though the COPY is dead, there's no pass between vmv0 elimination and regalloc that will delete it so regalloc will try to allocate something for it. The test showcasing this is added in vmv0-elimination.mir. Removing the dead COPY results in changes in spills in the >= LMUL 16 VP tests, but it's worth noting that these tests are very noisy and not representative of real world code.
This is the follow up to #125026 that keeps mask operands in virtual register form for as long as possible throughout the backend.
The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer kicking in.
The invariant that the mask COPY never has a subreg no longer holds after MachineCSE (it coalesces some copies), so it needed to be relaxed.