[NVPTX] Prefer prmt.b32 over bfi.b32 #110766

justinfargnoli · 2024-10-01T23:49:01Z

In [NVPTX] Improve lowering of v4i8 @Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX instructions. @Artem-B did this because: (source)

Under the hood byte extraction/insertion ends up as BFI/BFE instructions, so we may as well do that in PTX, too. https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern architectures, ptxas uses prmt.b32. Example.

Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.

llvmbot · 2024-10-01T23:49:34Z

@llvm/pr-subscribers-backend-nvptx

Author: Justin Fargnoli (justinfargnoli)

Changes

In [NVPTX] Improve lowering of v4i8 @Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX instructions. @Artem-B did this because: (source)

Under the hood byte extraction/insertion ends up as BFI/BFE instructions, so we may as well do that in PTX, too. https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern architectures, ptxas uses prmt.b32. Example.
Thus, remove all uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.

Patch is 41.30 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110766.diff

3 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+17-14)
(modified) llvm/test/CodeGen/NVPTX/i8x4-instructions.ll (+191-191)
(modified) llvm/test/CodeGen/NVPTX/sext-setcc.ll (+6-6)

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 8812136733fb24..b6fc8eff56d6ea 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -2328,20 +2328,23 @@ SDValue NVPTXTargetLowering::LowerBUILD_VECTOR(SDValue Op,
     // Lower non-const v4i8 vector as byte-wise constructed i32, which allows us
     // to optimize calculation of constant parts.
     if (VT == MVT::v4i8) {
-      SDValue C8 = DAG.getConstant(8, DL, MVT::i32);
-      SDValue E01 = DAG.getNode(
-          NVPTXISD::BFI, DL, MVT::i32,
-          DAG.getAnyExtOrTrunc(Op->getOperand(1), DL, MVT::i32),
-          DAG.getAnyExtOrTrunc(Op->getOperand(0), DL, MVT::i32), C8, C8);
-      SDValue E012 =
-          DAG.getNode(NVPTXISD::BFI, DL, MVT::i32,
-                      DAG.getAnyExtOrTrunc(Op->getOperand(2), DL, MVT::i32),
-                      E01, DAG.getConstant(16, DL, MVT::i32), C8);
-      SDValue E0123 =
-          DAG.getNode(NVPTXISD::BFI, DL, MVT::i32,
-                      DAG.getAnyExtOrTrunc(Op->getOperand(3), DL, MVT::i32),
-                      E012, DAG.getConstant(24, DL, MVT::i32), C8);
-      return DAG.getNode(ISD::BITCAST, DL, VT, E0123);
+      SDValue PRMT__10 = DAG.getNode(
+          NVPTXISD::PRMT, DL, MVT::v4i8,
+          {DAG.getAnyExtOrTrunc(Op->getOperand(0), DL, MVT::i32),
+           DAG.getAnyExtOrTrunc(Op->getOperand(1), DL, MVT::i32),
+           DAG.getConstant(0x3340, DL, MVT::i32),
+           DAG.getConstant(NVPTX::PTXPrmtMode::NONE, DL, MVT::i32)});
+      SDValue PRMT_210 = DAG.getNode(
+          NVPTXISD::PRMT, DL, MVT::v4i8,
+          {PRMT__10, DAG.getAnyExtOrTrunc(Op->getOperand(2), DL, MVT::i32),
+           DAG.getConstant(0x3410, DL, MVT::i32),
+           DAG.getConstant(NVPTX::PTXPrmtMode::NONE, DL, MVT::i32)});
+      SDValue PRMT3210 = DAG.getNode(
+          NVPTXISD::PRMT, DL, MVT::v4i8,
+          {PRMT_210, DAG.getAnyExtOrTrunc(Op->getOperand(3), DL, MVT::i32),
+           DAG.getConstant(0x4210, DL, MVT::i32),
+           DAG.getConstant(NVPTX::PTXPrmtMode::NONE, DL, MVT::i32)});
+      return DAG.getNode(ISD::BITCAST, DL, VT, PRMT3210);
     }
     return Op;
   }
diff --git a/llvm/test/CodeGen/NVPTX/i8x4-instructions.ll b/llvm/test/CodeGen/NVPTX/i8x4-instructions.ll
index 96a4359d0ec43e..fdc25cf95d06af 100644
--- a/llvm/test/CodeGen/NVPTX/i8x4-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/i8x4-instructions.ll
@@ -101,38 +101,38 @@ define <4 x i8> @test_add(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-LABEL: test_add(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b16 %rs<13>;
-; CHECK-NEXT:    .reg .b32 %r<19>;
+; CHECK-NEXT:    .reg .b32 %r<18>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r2, [test_add_param_1];
 ; CHECK-NEXT:    ld.param.u32 %r1, [test_add_param_0];
-; CHECK-NEXT:    bfe.u32 %r3, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r3, %r2, 8, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs1, %r3;
-; CHECK-NEXT:    bfe.u32 %r4, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r4, %r1, 8, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs2, %r4;
 ; CHECK-NEXT:    add.s16 %rs3, %rs2, %rs1;
 ; CHECK-NEXT:    cvt.u32.u16 %r5, %rs3;
-; CHECK-NEXT:    bfe.u32 %r6, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r6, %r2, 0, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs4, %r6;
-; CHECK-NEXT:    bfe.u32 %r7, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r7, %r1, 0, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs5, %r7;
 ; CHECK-NEXT:    add.s16 %rs6, %rs5, %rs4;
 ; CHECK-NEXT:    cvt.u32.u16 %r8, %rs6;
-; CHECK-NEXT:    bfi.b32 %r9, %r8, %r5, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r9, %r8, %r5, 13120;
 ; CHECK-NEXT:    bfe.u32 %r10, %r2, 16, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs7, %r10;
 ; CHECK-NEXT:    bfe.u32 %r11, %r1, 16, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs8, %r11;
 ; CHECK-NEXT:    add.s16 %rs9, %rs8, %rs7;
 ; CHECK-NEXT:    cvt.u32.u16 %r12, %rs9;
-; CHECK-NEXT:    bfi.b32 %r13, %r12, %r9, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r13, %r9, %r12, 13328;
 ; CHECK-NEXT:    bfe.u32 %r14, %r2, 24, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs10, %r14;
 ; CHECK-NEXT:    bfe.u32 %r15, %r1, 24, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs11, %r15;
 ; CHECK-NEXT:    add.s16 %rs12, %rs11, %rs10;
 ; CHECK-NEXT:    cvt.u32.u16 %r16, %rs12;
-; CHECK-NEXT:    bfi.b32 %r17, %r16, %r13, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r17, %r13, %r16, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r17;
 ; CHECK-NEXT:    ret;
   %r = add <4 x i8> %a, %b
@@ -143,29 +143,29 @@ define <4 x i8> @test_add_imm_0(<4 x i8> %a) #0 {
 ; CHECK-LABEL: test_add_imm_0(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b16 %rs<9>;
-; CHECK-NEXT:    .reg .b32 %r<14>;
+; CHECK-NEXT:    .reg .b32 %r<13>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r1, [test_add_imm_0_param_0];
-; CHECK-NEXT:    bfe.u32 %r2, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r2, %r1, 8, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs1, %r2;
-; CHECK-NEXT:    add.s16 %rs2, %rs1, 1;
+; CHECK-NEXT:    add.s16 %rs2, %rs1, 2;
 ; CHECK-NEXT:    cvt.u32.u16 %r3, %rs2;
-; CHECK-NEXT:    bfe.u32 %r4, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r4, %r1, 0, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs3, %r4;
-; CHECK-NEXT:    add.s16 %rs4, %rs3, 2;
+; CHECK-NEXT:    add.s16 %rs4, %rs3, 1;
 ; CHECK-NEXT:    cvt.u32.u16 %r5, %rs4;
-; CHECK-NEXT:    bfi.b32 %r6, %r5, %r3, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r6, %r5, %r3, 13120;
 ; CHECK-NEXT:    bfe.u32 %r7, %r1, 16, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs5, %r7;
 ; CHECK-NEXT:    add.s16 %rs6, %rs5, 3;
 ; CHECK-NEXT:    cvt.u32.u16 %r8, %rs6;
-; CHECK-NEXT:    bfi.b32 %r9, %r8, %r6, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r9, %r6, %r8, 13328;
 ; CHECK-NEXT:    bfe.u32 %r10, %r1, 24, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs7, %r10;
 ; CHECK-NEXT:    add.s16 %rs8, %rs7, 4;
 ; CHECK-NEXT:    cvt.u32.u16 %r11, %rs8;
-; CHECK-NEXT:    bfi.b32 %r12, %r11, %r9, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r12, %r9, %r11, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r12;
 ; CHECK-NEXT:    ret;
   %r = add <4 x i8> <i8 1, i8 2, i8 3, i8 4>, %a
@@ -176,29 +176,29 @@ define <4 x i8> @test_add_imm_1(<4 x i8> %a) #0 {
 ; CHECK-LABEL: test_add_imm_1(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b16 %rs<9>;
-; CHECK-NEXT:    .reg .b32 %r<14>;
+; CHECK-NEXT:    .reg .b32 %r<13>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r1, [test_add_imm_1_param_0];
-; CHECK-NEXT:    bfe.u32 %r2, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r2, %r1, 8, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs1, %r2;
-; CHECK-NEXT:    add.s16 %rs2, %rs1, 1;
+; CHECK-NEXT:    add.s16 %rs2, %rs1, 2;
 ; CHECK-NEXT:    cvt.u32.u16 %r3, %rs2;
-; CHECK-NEXT:    bfe.u32 %r4, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r4, %r1, 0, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs3, %r4;
-; CHECK-NEXT:    add.s16 %rs4, %rs3, 2;
+; CHECK-NEXT:    add.s16 %rs4, %rs3, 1;
 ; CHECK-NEXT:    cvt.u32.u16 %r5, %rs4;
-; CHECK-NEXT:    bfi.b32 %r6, %r5, %r3, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r6, %r5, %r3, 13120;
 ; CHECK-NEXT:    bfe.u32 %r7, %r1, 16, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs5, %r7;
 ; CHECK-NEXT:    add.s16 %rs6, %rs5, 3;
 ; CHECK-NEXT:    cvt.u32.u16 %r8, %rs6;
-; CHECK-NEXT:    bfi.b32 %r9, %r8, %r6, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r9, %r6, %r8, 13328;
 ; CHECK-NEXT:    bfe.u32 %r10, %r1, 24, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs7, %r10;
 ; CHECK-NEXT:    add.s16 %rs8, %rs7, 4;
 ; CHECK-NEXT:    cvt.u32.u16 %r11, %rs8;
-; CHECK-NEXT:    bfi.b32 %r12, %r11, %r9, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r12, %r9, %r11, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r12;
 ; CHECK-NEXT:    ret;
   %r = add <4 x i8> %a, <i8 1, i8 2, i8 3, i8 4>
@@ -209,38 +209,38 @@ define <4 x i8> @test_sub(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-LABEL: test_sub(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b16 %rs<13>;
-; CHECK-NEXT:    .reg .b32 %r<19>;
+; CHECK-NEXT:    .reg .b32 %r<18>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r2, [test_sub_param_1];
 ; CHECK-NEXT:    ld.param.u32 %r1, [test_sub_param_0];
-; CHECK-NEXT:    bfe.u32 %r3, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r3, %r2, 8, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs1, %r3;
-; CHECK-NEXT:    bfe.u32 %r4, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r4, %r1, 8, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs2, %r4;
 ; CHECK-NEXT:    sub.s16 %rs3, %rs2, %rs1;
 ; CHECK-NEXT:    cvt.u32.u16 %r5, %rs3;
-; CHECK-NEXT:    bfe.u32 %r6, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r6, %r2, 0, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs4, %r6;
-; CHECK-NEXT:    bfe.u32 %r7, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r7, %r1, 0, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs5, %r7;
 ; CHECK-NEXT:    sub.s16 %rs6, %rs5, %rs4;
 ; CHECK-NEXT:    cvt.u32.u16 %r8, %rs6;
-; CHECK-NEXT:    bfi.b32 %r9, %r8, %r5, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r9, %r8, %r5, 13120;
 ; CHECK-NEXT:    bfe.u32 %r10, %r2, 16, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs7, %r10;
 ; CHECK-NEXT:    bfe.u32 %r11, %r1, 16, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs8, %r11;
 ; CHECK-NEXT:    sub.s16 %rs9, %rs8, %rs7;
 ; CHECK-NEXT:    cvt.u32.u16 %r12, %rs9;
-; CHECK-NEXT:    bfi.b32 %r13, %r12, %r9, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r13, %r9, %r12, 13328;
 ; CHECK-NEXT:    bfe.u32 %r14, %r2, 24, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs10, %r14;
 ; CHECK-NEXT:    bfe.u32 %r15, %r1, 24, 8;
 ; CHECK-NEXT:    cvt.u16.u32 %rs11, %r15;
 ; CHECK-NEXT:    sub.s16 %rs12, %rs11, %rs10;
 ; CHECK-NEXT:    cvt.u32.u16 %r16, %rs12;
-; CHECK-NEXT:    bfi.b32 %r17, %r16, %r13, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r17, %r13, %r16, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r17;
 ; CHECK-NEXT:    ret;
   %r = sub <4 x i8> %a, %b
@@ -251,7 +251,7 @@ define <4 x i8> @test_smax(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-LABEL: test_smax(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .pred %p<5>;
-; CHECK-NEXT:    .reg .b32 %r<27>;
+; CHECK-NEXT:    .reg .b32 %r<26>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r2, [test_smax_param_1];
@@ -262,27 +262,27 @@ define <4 x i8> @test_smax(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-NEXT:    bfe.s32 %r5, %r2, 16, 8;
 ; CHECK-NEXT:    bfe.s32 %r6, %r1, 16, 8;
 ; CHECK-NEXT:    setp.gt.s32 %p2, %r6, %r5;
-; CHECK-NEXT:    bfe.s32 %r7, %r2, 8, 8;
-; CHECK-NEXT:    bfe.s32 %r8, %r1, 8, 8;
+; CHECK-NEXT:    bfe.s32 %r7, %r2, 0, 8;
+; CHECK-NEXT:    bfe.s32 %r8, %r1, 0, 8;
 ; CHECK-NEXT:    setp.gt.s32 %p3, %r8, %r7;
-; CHECK-NEXT:    bfe.s32 %r9, %r2, 0, 8;
-; CHECK-NEXT:    bfe.s32 %r10, %r1, 0, 8;
+; CHECK-NEXT:    bfe.s32 %r9, %r2, 8, 8;
+; CHECK-NEXT:    bfe.s32 %r10, %r1, 8, 8;
 ; CHECK-NEXT:    setp.gt.s32 %p4, %r10, %r9;
 ; CHECK-NEXT:    bfe.u32 %r11, %r1, 24, 8;
 ; CHECK-NEXT:    bfe.u32 %r12, %r1, 16, 8;
-; CHECK-NEXT:    bfe.u32 %r13, %r1, 8, 8;
-; CHECK-NEXT:    bfe.u32 %r14, %r1, 0, 8;
-; CHECK-NEXT:    bfe.u32 %r15, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r13, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r14, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r15, %r2, 8, 8;
 ; CHECK-NEXT:    selp.b32 %r16, %r14, %r15, %p4;
-; CHECK-NEXT:    bfe.u32 %r17, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r17, %r2, 0, 8;
 ; CHECK-NEXT:    selp.b32 %r18, %r13, %r17, %p3;
-; CHECK-NEXT:    bfi.b32 %r19, %r18, %r16, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r19, %r18, %r16, 13120;
 ; CHECK-NEXT:    bfe.u32 %r20, %r2, 16, 8;
 ; CHECK-NEXT:    selp.b32 %r21, %r12, %r20, %p2;
-; CHECK-NEXT:    bfi.b32 %r22, %r21, %r19, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r22, %r19, %r21, 13328;
 ; CHECK-NEXT:    bfe.u32 %r23, %r2, 24, 8;
 ; CHECK-NEXT:    selp.b32 %r24, %r11, %r23, %p1;
-; CHECK-NEXT:    bfi.b32 %r25, %r24, %r22, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r25, %r22, %r24, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r25;
 ; CHECK-NEXT:    ret;
   %cmp = icmp sgt <4 x i8> %a, %b
@@ -294,7 +294,7 @@ define <4 x i8> @test_umax(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-LABEL: test_umax(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .pred %p<5>;
-; CHECK-NEXT:    .reg .b32 %r<19>;
+; CHECK-NEXT:    .reg .b32 %r<18>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r2, [test_umax_param_1];
@@ -305,19 +305,19 @@ define <4 x i8> @test_umax(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-NEXT:    bfe.u32 %r5, %r2, 16, 8;
 ; CHECK-NEXT:    bfe.u32 %r6, %r1, 16, 8;
 ; CHECK-NEXT:    setp.hi.u32 %p2, %r6, %r5;
-; CHECK-NEXT:    bfe.u32 %r7, %r2, 8, 8;
-; CHECK-NEXT:    bfe.u32 %r8, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r7, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r8, %r1, 0, 8;
 ; CHECK-NEXT:    setp.hi.u32 %p3, %r8, %r7;
-; CHECK-NEXT:    bfe.u32 %r9, %r2, 0, 8;
-; CHECK-NEXT:    bfe.u32 %r10, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r9, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r10, %r1, 8, 8;
 ; CHECK-NEXT:    setp.hi.u32 %p4, %r10, %r9;
 ; CHECK-NEXT:    selp.b32 %r11, %r10, %r9, %p4;
 ; CHECK-NEXT:    selp.b32 %r12, %r8, %r7, %p3;
-; CHECK-NEXT:    bfi.b32 %r13, %r12, %r11, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r13, %r12, %r11, 13120;
 ; CHECK-NEXT:    selp.b32 %r14, %r6, %r5, %p2;
-; CHECK-NEXT:    bfi.b32 %r15, %r14, %r13, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r15, %r13, %r14, 13328;
 ; CHECK-NEXT:    selp.b32 %r16, %r4, %r3, %p1;
-; CHECK-NEXT:    bfi.b32 %r17, %r16, %r15, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r17, %r15, %r16, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r17;
 ; CHECK-NEXT:    ret;
   %cmp = icmp ugt <4 x i8> %a, %b
@@ -329,7 +329,7 @@ define <4 x i8> @test_smin(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-LABEL: test_smin(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .pred %p<5>;
-; CHECK-NEXT:    .reg .b32 %r<27>;
+; CHECK-NEXT:    .reg .b32 %r<26>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r2, [test_smin_param_1];
@@ -340,27 +340,27 @@ define <4 x i8> @test_smin(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-NEXT:    bfe.s32 %r5, %r2, 16, 8;
 ; CHECK-NEXT:    bfe.s32 %r6, %r1, 16, 8;
 ; CHECK-NEXT:    setp.le.s32 %p2, %r6, %r5;
-; CHECK-NEXT:    bfe.s32 %r7, %r2, 8, 8;
-; CHECK-NEXT:    bfe.s32 %r8, %r1, 8, 8;
+; CHECK-NEXT:    bfe.s32 %r7, %r2, 0, 8;
+; CHECK-NEXT:    bfe.s32 %r8, %r1, 0, 8;
 ; CHECK-NEXT:    setp.le.s32 %p3, %r8, %r7;
-; CHECK-NEXT:    bfe.s32 %r9, %r2, 0, 8;
-; CHECK-NEXT:    bfe.s32 %r10, %r1, 0, 8;
+; CHECK-NEXT:    bfe.s32 %r9, %r2, 8, 8;
+; CHECK-NEXT:    bfe.s32 %r10, %r1, 8, 8;
 ; CHECK-NEXT:    setp.le.s32 %p4, %r10, %r9;
 ; CHECK-NEXT:    bfe.u32 %r11, %r1, 24, 8;
 ; CHECK-NEXT:    bfe.u32 %r12, %r1, 16, 8;
-; CHECK-NEXT:    bfe.u32 %r13, %r1, 8, 8;
-; CHECK-NEXT:    bfe.u32 %r14, %r1, 0, 8;
-; CHECK-NEXT:    bfe.u32 %r15, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r13, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r14, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r15, %r2, 8, 8;
 ; CHECK-NEXT:    selp.b32 %r16, %r14, %r15, %p4;
-; CHECK-NEXT:    bfe.u32 %r17, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r17, %r2, 0, 8;
 ; CHECK-NEXT:    selp.b32 %r18, %r13, %r17, %p3;
-; CHECK-NEXT:    bfi.b32 %r19, %r18, %r16, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r19, %r18, %r16, 13120;
 ; CHECK-NEXT:    bfe.u32 %r20, %r2, 16, 8;
 ; CHECK-NEXT:    selp.b32 %r21, %r12, %r20, %p2;
-; CHECK-NEXT:    bfi.b32 %r22, %r21, %r19, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r22, %r19, %r21, 13328;
 ; CHECK-NEXT:    bfe.u32 %r23, %r2, 24, 8;
 ; CHECK-NEXT:    selp.b32 %r24, %r11, %r23, %p1;
-; CHECK-NEXT:    bfi.b32 %r25, %r24, %r22, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r25, %r22, %r24, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r25;
 ; CHECK-NEXT:    ret;
   %cmp = icmp sle <4 x i8> %a, %b
@@ -372,7 +372,7 @@ define <4 x i8> @test_umin(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-LABEL: test_umin(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .pred %p<5>;
-; CHECK-NEXT:    .reg .b32 %r<19>;
+; CHECK-NEXT:    .reg .b32 %r<18>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r2, [test_umin_param_1];
@@ -383,19 +383,19 @@ define <4 x i8> @test_umin(<4 x i8> %a, <4 x i8> %b) #0 {
 ; CHECK-NEXT:    bfe.u32 %r5, %r2, 16, 8;
 ; CHECK-NEXT:    bfe.u32 %r6, %r1, 16, 8;
 ; CHECK-NEXT:    setp.ls.u32 %p2, %r6, %r5;
-; CHECK-NEXT:    bfe.u32 %r7, %r2, 8, 8;
-; CHECK-NEXT:    bfe.u32 %r8, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r7, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r8, %r1, 0, 8;
 ; CHECK-NEXT:    setp.ls.u32 %p3, %r8, %r7;
-; CHECK-NEXT:    bfe.u32 %r9, %r2, 0, 8;
-; CHECK-NEXT:    bfe.u32 %r10, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r9, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r10, %r1, 8, 8;
 ; CHECK-NEXT:    setp.ls.u32 %p4, %r10, %r9;
 ; CHECK-NEXT:    selp.b32 %r11, %r10, %r9, %p4;
 ; CHECK-NEXT:    selp.b32 %r12, %r8, %r7, %p3;
-; CHECK-NEXT:    bfi.b32 %r13, %r12, %r11, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r13, %r12, %r11, 13120;
 ; CHECK-NEXT:    selp.b32 %r14, %r6, %r5, %p2;
-; CHECK-NEXT:    bfi.b32 %r15, %r14, %r13, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r15, %r13, %r14, 13328;
 ; CHECK-NEXT:    selp.b32 %r16, %r4, %r3, %p1;
-; CHECK-NEXT:    bfi.b32 %r17, %r16, %r15, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r17, %r15, %r16, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r17;
 ; CHECK-NEXT:    ret;
   %cmp = icmp ule <4 x i8> %a, %b
@@ -407,7 +407,7 @@ define <4 x i8> @test_eq(<4 x i8> %a, <4 x i8> %b, <4 x i8> %c) #0 {
 ; CHECK-LABEL: test_eq(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .pred %p<5>;
-; CHECK-NEXT:    .reg .b32 %r<24>;
+; CHECK-NEXT:    .reg .b32 %r<23>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r3, [test_eq_param_2];
@@ -419,23 +419,23 @@ define <4 x i8> @test_eq(<4 x i8> %a, <4 x i8> %b, <4 x i8> %c) #0 {
 ; CHECK-NEXT:    bfe.u32 %r6, %r2, 16, 8;
 ; CHECK-NEXT:    bfe.u32 %r7, %r1, 16, 8;
 ; CHECK-NEXT:    setp.eq.u32 %p2, %r7, %r6;
-; CHECK-NEXT:    bfe.u32 %r8, %r2, 8, 8;
-; CHECK-NEXT:    bfe.u32 %r9, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r8, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r9, %r1, 0, 8;
 ; CHECK-NEXT:    setp.eq.u32 %p3, %r9, %r8;
-; CHECK-NEXT:    bfe.u32 %r10, %r2, 0, 8;
-; CHECK-NEXT:    bfe.u32 %r11, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r10, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r11, %r1, 8, 8;
 ; CHECK-NEXT:    setp.eq.u32 %p4, %r11, %r10;
-; CHECK-NEXT:    bfe.u32 %r12, %r3, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r12, %r3, 8, 8;
 ; CHECK-NEXT:    selp.b32 %r13, %r11, %r12, %p4;
-; CHECK-NEXT:    bfe.u32 %r14, %r3, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r14, %r3, 0, 8;
 ; CHECK-NEXT:    selp.b32 %r15, %r9, %r14, %p3;
-; CHECK-NEXT:    bfi.b32 %r16, %r15, %r13, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r16, %r15, %r13, 13120;
 ; CHECK-NEXT:    bfe.u32 %r17, %r3, 16, 8;
 ; CHECK-NEXT:    selp.b32 %r18, %r7, %r17, %p2;
-; CHECK-NEXT:    bfi.b32 %r19, %r18, %r16, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r19, %r16, %r18, 13328;
 ; CHECK-NEXT:    bfe.u32 %r20, %r3, 24, 8;
 ; CHECK-NEXT:    selp.b32 %r21, %r5, %r20, %p1;
-; CHECK-NEXT:    bfi.b32 %r22, %r21, %r19, 24, 8;
+; CHECK-NEXT:    prmt.b32 %r22, %r19, %r21, 16912;
 ; CHECK-NEXT:    st.param.b32 [func_retval0+0], %r22;
 ; CHECK-NEXT:    ret;
   %cmp = icmp eq <4 x i8> %a, %b
@@ -447,7 +447,7 @@ define <4 x i8> @test_ne(<4 x i8> %a, <4 x i8> %b, <4 x i8> %c) #0 {
 ; CHECK-LABEL: test_ne(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .pred %p<5>;
-; CHECK-NEXT:    .reg .b32 %r<24>;
+; CHECK-NEXT:    .reg .b32 %r<23>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r3, [test_ne_param_2];
@@ -459,23 +459,23 @@ define <4 x i8> @test_ne(<4 x i8> %a, <4 x i8> %b, <4 x i8> %c) #0 {
 ; CHECK-NEXT:    bfe.u32 %r6, %r2, 16, 8;
 ; CHECK-NEXT:    bfe.u32 %r7, %r1, 16, 8;
 ; CHECK-NEXT:    setp.ne.u32 %p2, %r7, %r6;
-; CHECK-NEXT:    bfe.u32 %r8, %r2, 8, 8;
-; CHECK-NEXT:    bfe.u32 %r9, %r1, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r8, %r2, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r9, %r1, 0, 8;
 ; CHECK-NEXT:    setp.ne.u32 %p3, %r9, %r8;
-; CHECK-NEXT:    bfe.u32 %r10, %r2, 0, 8;
-; CHECK-NEXT:    bfe.u32 %r11, %r1, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r10, %r2, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r11, %r1, 8, 8;
 ; CHECK-NEXT:    setp.ne.u32 %p4, %r11, %r10;
-; CHECK-NEXT:    bfe.u32 %r12, %r3, 0, 8;
+; CHECK-NEXT:    bfe.u32 %r12, %r3, 8, 8;
 ; CHECK-NEXT:    selp.b32 %r13, %r11, %r12, %p4;
-; CHECK-NEXT:    bfe.u32 %r14, %r3, 8, 8;
+; CHECK-NEXT:    bfe.u32 %r14, %r3, 0, 8;
 ; CHECK-NEXT:    selp.b32 %r15, %r9, %r14, %p3;
-; CHECK-NEXT:    bfi.b32 %r16, %r15, %r13, 8, 8;
+; CHECK-NEXT:    prmt.b32 %r16, %r15, %r13, 13120;
 ; CHECK-NEXT:    bfe.u32 %r17, %r3, 16, 8;
 ; CHECK-NEXT:    selp.b32 %r18, %r7, %r17, %p2;
-; CHECK-NEXT:    bfi.b32 %r19, %r18, %r16, 16, 8;
+; CHECK-NEXT:    prmt.b32 %r19, %r16, %r18, 13328;
 ; CHECK-NEXT:    bfe.u32 %...
[truncated]

Artem-B · 2024-10-02T18:45:32Z

Indeed. According to https://arxiv.org/pdf/2208.11174 BFI is much more expensive than PRMT which appears to take just 1 cycle on A100:

Artem-B · 2024-10-02T18:47:03Z

@justinfargnoli -- any chance NVIDIA could publish more details on the instruction timings. It's kind of hard to tell how fast something is expected to be by looking at the mnemonics alone.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

justinfargnoli · 2024-10-04T19:53:56Z

According to https://arxiv.org/pdf/2208.11174 BFI is much more expensive than PRMT which appears to take just 1 cycle on A100:

I think this refers to bfi in the general case. When the c and d operands are multiples of 8, bfi can be run as a prmt.

github-actions · 2024-10-04T19:55:13Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Artem-B

Please wait till Monday to land it, just in case.

Artem-B · 2024-10-04T23:24:06Z

According to https://arxiv.org/pdf/2208.11174 BFI is much more expensive than PRMT which appears to take just 1 cycle on A100:

I think this refers to bfi in the general case. When the c and d operands are multiples of 8, bfi can be run as a prmt.

Ptxas actully does compile such bfi into prmt: https://godbolt.org/z/9xTrhG99v
This implies that this patch is likely a no-op for that actual GPU code on the SASS level.

Do we still want or need it?

kalxr · 2024-10-09T21:21:57Z

Converting from v = prmt(d, prmt(c, prmt(a,b))) to v = prmt(prmt(c,d), prmt(a,b)) may squeeze a bit more performance here if GPU can do two leaf permutes in parallel as they are independent.

I agree that this would be better for performance, and it doesn't look like ptxas does this.

Do we still want or need it?

Given that this change would include the above improvement, IMO it would be useful.

justinfargnoli · 2024-10-09T21:36:36Z

This implies that this patch is likely a no-op for that actual GPU code on the SASS level.
Do we still want or need it?

CC @AlexMaclean, as he's voiced similar concerns offline.

The original goal of this PR was to implement what I thought you had wanted to do in [NVPTX] Improve lowering of v4i8 based on your comment. However, as @kalxr has pointed out, this PR also includes the prmt(d, prmt(c, prmt(a,b))) --> prmt(prmt(c,d), prmt(a,b)) transformation that makes this PR more performant than the current approach.

Additionally, I plan to work on further optimizations for prmt, and this change would enable those future improvements to be leveraged.

Artem-B · 2024-10-10T01:03:58Z

I actually like ptmt better than bfi, as it also allows us to handle sign extension. We could use it as a base or example for cleaning up other i8/i16/i32 conversions later.

I think you're good to go.

Artem-B · 2024-10-10T17:37:29Z

A logical follow-up to this patch would be to replace bfe -> prmt, too for byte extraction.

We should also take a look at v4i8 <-> v4i16 conversions. If we're not using prmt there, we should.

@Artem-B

In [[NVPTX] Improve lowering of v4i8](llvm@cbafb6f) @Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX instructions. @Artem-B did this because: ([source](llvm#67866 (comment))) > Under the hood byte extraction/insertion ends up as BFI/BFE instructions, so we may as well do that in PTX, too. https://godbolt.org/z/Tb3zWbj9b However, the example that @Artem-B linked was targeting sm_52. On modern architectures, ptxas uses prmt.b32. [Example](https://godbolt.org/z/Ye4W1n84o). Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.

akuegel · 2024-10-11T11:37:07Z

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

+          NVPTXISD::PRMT, DL, MVT::v4i8,
+          {DAG.getAnyExtOrTrunc(Op->getOperand(2), DL, MVT::i32),
+           DAG.getAnyExtOrTrunc(Op->getOperand(3), DL, MVT::i32),
+           DAG.getConstant(0x4033, DL, MVT::i32),


Looks like the mode here needs to be 0x3340, then our tests pass.

@justinfargnoli for visibility

cota · 2024-10-11T12:16:28Z

Thanks @akuegel for pointing that out. I will revert this now since that confirms that this commit is the root cause of the failures we are seeing in our downstream tests (gross numerical disparities), despite no upstream tests being broken.

This reverts commit 3f9998a. It breaks downstream tests with egregious numerical differences. Unfortunately no upstream tests are broken, but the fact that a prior iteration of the commit (pre-optimization) does work with our downstream tests (coming from the Triton repo) supports the claim that the final version of the commit is incorrect. Reverting now so that the original author can evaluate.

…ent-indentonly * llvm-trunk/main: (6379 commits) [gn build] Port 1c94388 [RISCV] Introduce VLOptimizer pass (llvm#108640) [mlir][vector] Add more tests for ConvertVectorToLLVM (7/n) (llvm#111895) [libc++] Add output groups to run-buildbot (llvm#111739) [libc++abi] Remove unused LIBCXXABI_LIBCXX_INCLUDES CMake option (llvm#111824) [clang] Ignore inline namespace for `hasName` (llvm#109147) [AArch64] Disable consecutive store merging when Neon is unavailable (llvm#111519) [lldb] Fix finding make tool for tests (llvm#111980) Turn `-Wdeprecated-literal-operator` on by default (llvm#111027) [AMDGPU] Rewrite RegSeqNames using !foreach. NFC. (llvm#111994) Revert "Reland: [clang] Finish implementation of P0522 (llvm#111711)" Revert "[clang] CWG2398: improve overload resolution backwards compat (llvm#107350)" Revert "[clang] Implement TTP P0522 pack matching for deduced function template calls. (llvm#111457)" [Clang] Replace Intrinsic::getDeclaration with getOrInsertDeclaration (llvm#111990) Revert "[NVPTX] Prefer prmt.b32 over bfi.b32 (llvm#110766)" [RISCV] Add DAG combine to turn (sub (shl X, 8-Y), (shr X, Y)) into orc.b (llvm#111828) [libc] Fix compilation of new trig functions (llvm#111987) [NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (llvm#111752) [NFC][CodingStandard] Add additional example for if-else brace rule (llvm#111733) CodeGen: Remove redundant REQUIRES registered-target from tests (llvm#111982) ...

@Artem-B

In [[NVPTX] Improve lowering of v4i8](llvm@cbafb6f) @Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX instructions. @Artem-B did this because: ([source](llvm#67866 (comment))) > Under the hood byte extraction/insertion ends up as BFI/BFE instructions, so we may as well do that in PTX, too. https://godbolt.org/z/Tb3zWbj9b However, the example that @Artem-B linked was targeting sm_52. On modern architectures, ptxas uses prmt.b32. [Example](https://godbolt.org/z/Ye4W1n84o). Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.

This reverts commit 3f9998a. It breaks downstream tests with egregious numerical differences. Unfortunately no upstream tests are broken, but the fact that a prior iteration of the commit (pre-optimization) does work with our downstream tests (coming from the Triton repo) supports the claim that the final version of the commit is incorrect. Reverting now so that the original author can evaluate.

@akuegel

Fix [failure](#110766 (comment)) identified by @akuegel. --- In [[NVPTX] Improve lowering of v4i8](cbafb6f) @Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX instructions. @Artem-B did this because: (#67866 (comment)) Under the hood byte extraction/insertion ends up as BFI/BFE instructions, so we may as well do that in PTX, too. https://godbolt.org/z/Tb3zWbj9b However, the example that @Artem-B linked was targeting sm_52. On modern architectures, ptxas uses prmt.b32. [Example](https://godbolt.org/z/Ye4W1n84o). Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.

@akuegel

Fix [failure](llvm#110766 (comment)) identified by @akuegel. --- In [[NVPTX] Improve lowering of v4i8](llvm@cbafb6f) @Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX instructions. @Artem-B did this because: (llvm#67866 (comment)) Under the hood byte extraction/insertion ends up as BFI/BFE instructions, so we may as well do that in PTX, too. https://godbolt.org/z/Tb3zWbj9b However, the example that @Artem-B linked was targeting sm_52. On modern architectures, ptxas uses prmt.b32. [Example](https://godbolt.org/z/Ye4W1n84o). Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.

@akuegel

Fix [failure](llvm#110766 (comment)) identified by @akuegel. --- In [[NVPTX] Improve lowering of v4i8](llvm@cbafb6f) @Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX instructions. @Artem-B did this because: (llvm#67866 (comment)) Under the hood byte extraction/insertion ends up as BFI/BFE instructions, so we may as well do that in PTX, too. https://godbolt.org/z/Tb3zWbj9b However, the example that @Artem-B linked was targeting sm_52. On modern architectures, ptxas uses prmt.b32. [Example](https://godbolt.org/z/Ye4W1n84o). Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.

justinfargnoli added 2 commits September 24, 2024 17:08

Initial commit

52a1db5

Update sext-setcc.ll

8b9c99f

justinfargnoli requested review from AlexMaclean, Artem-B and kalxr October 1, 2024 23:49

justinfargnoli self-assigned this Oct 1, 2024

llvmbot added the backend:NVPTX label Oct 1, 2024

Artem-B reviewed Oct 2, 2024

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp Show resolved Hide resolved

prmt(d, prmt(c, prmt(a,b))) --> prmt(prmt(c,d), prmt(a,b))

5b1853f

justinfargnoli added 2 commits October 4, 2024 12:56

clang-format

6c3197c

Update tests

eb105b4

Artem-B approved these changes Oct 4, 2024

View reviewed changes

kalxr approved these changes Oct 10, 2024

View reviewed changes

justinfargnoli merged commit 3f9998a into llvm:main Oct 10, 2024
9 checks passed

justinfargnoli deleted the dev/jf/upstream-bfi-to-prmt branch October 10, 2024 17:24

akuegel reviewed Oct 11, 2024

View reviewed changes

justinfargnoli mentioned this pull request Oct 30, 2024

Reland "[NVPTX] Prefer prmt.b32 over bfi.b32" #114326

Merged

[NVPTX] Prefer prmt.b32 over bfi.b32 #110766

[NVPTX] Prefer prmt.b32 over bfi.b32 #110766

Uh oh!

Conversation

justinfargnoli commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 1, 2024

Uh oh!

Artem-B commented Oct 2, 2024

Uh oh!

Artem-B commented Oct 2, 2024

Uh oh!

Uh oh!

justinfargnoli commented Oct 4, 2024

Uh oh!

github-actions bot commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Artem-B commented Oct 4, 2024

Uh oh!

kalxr commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinfargnoli commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B commented Oct 10, 2024

Uh oh!

Uh oh!

Artem-B commented Oct 10, 2024

Uh oh!

akuegel Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

akuegel Oct 15, 2024

Choose a reason for hiding this comment

Uh oh!

cota commented Oct 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

justinfargnoli commented Oct 1, 2024 •

edited

Loading

github-actions bot commented Oct 4, 2024 •

edited

Loading

kalxr commented Oct 9, 2024 •

edited

Loading

justinfargnoli commented Oct 9, 2024 •

edited

Loading