[VectorCombine] foldBitcastShuffle - peek through any residual bitcasts before creating a new bitcast on top #86119

RKSimon · 2024-03-21T13:45:43Z

Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

llvmbot · 2024-03-21T13:46:18Z

@llvm/pr-subscribers-llvm-transforms

Author: Simon Pilgrim (RKSimon)

Changes

Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

Full diff: https://github.com/llvm/llvm-project/pull/86119.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/VectorCombine.cpp (+14-2)
(modified) llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll (+2-4)
(modified) llvm/test/Transforms/VectorCombine/X86/shuffle.ll (+2-4)

diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index 7e86137f23f3c8..87e05dde545cd5 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -135,6 +135,18 @@ class VectorCombine {
 };
 } // namespace
 
+/// Return the source operand of a potentially bitcasted value while
+/// optionally checking if it has one use. If there is no bitcast or the one
+/// use check is not met, return the input value itself.
+static Value *peekThroughBitcast(Value *V, bool OneUseOnly = false) {
+  if (auto *BitCast = dyn_cast<BitCastInst>(V))
+    if (!OneUseOnly || BitCast->hasOneUse())
+      return BitCast->getOperand(0);
+
+  // V is not a bitcast or V has more than one use and OneUseOnly is true.
+  return V;
+}
+
 static bool canWidenLoad(LoadInst *Load, const TargetTransformInfo &TTI) {
   // Do not widen load if atomic/volatile or under asan/hwasan/memtag/tsan.
   // The widened load may load data from dirty regions or create data races
@@ -751,8 +763,8 @@ bool VectorCombine::foldBitcastShuffle(Instruction &I) {
 
   // bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'
   ++NumShufOfBitcast;
-  Value *CastV0 = Builder.CreateBitCast(V0, NewShuffleTy);
-  Value *CastV1 = Builder.CreateBitCast(V1, NewShuffleTy);
+  Value *CastV0 = Builder.CreateBitCast(peekThroughBitcast(V0), NewShuffleTy);
+  Value *CastV1 = Builder.CreateBitCast(peekThroughBitcast(V1), NewShuffleTy);
   Value *Shuf = Builder.CreateShuffleVector(CastV0, CastV1, NewMask);
   replaceValue(I, *Shuf);
   return true;
diff --git a/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll b/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll
index 8c5c6656ca1795..74a58c8d313611 100644
--- a/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll
@@ -133,8 +133,7 @@ define <2 x i64> @PR35454_1(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_1(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <16 x i8>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <16 x i8>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> <i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <16 x i8> [[BC1]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <16 x i8> [[ADD]] to <4 x i32>
@@ -164,8 +163,7 @@ define <2 x i64> @PR35454_2(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_2(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <8 x i16>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <8 x i16>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> poison, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <8 x i16> [[BC1]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <8 x i16> [[ADD]] to <4 x i32>
diff --git a/llvm/test/Transforms/VectorCombine/X86/shuffle.ll b/llvm/test/Transforms/VectorCombine/X86/shuffle.ll
index 60cfc4d4b07079..d1484fd5ab3399 100644
--- a/llvm/test/Transforms/VectorCombine/X86/shuffle.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/shuffle.ll
@@ -133,8 +133,7 @@ define <2 x i64> @PR35454_1(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_1(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <16 x i8>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <16 x i8>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> <i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <16 x i8> [[BC1]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <16 x i8> [[ADD]] to <4 x i32>
@@ -164,8 +163,7 @@ define <2 x i64> @PR35454_2(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_2(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <8 x i16>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <8 x i16>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> poison, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <8 x i16> [[BC1]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <8 x i16> [[ADD]] to <4 x i32>

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

dtcxzyw · 2024-03-21T14:02:51Z

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

FYI, folding bitcast of bitcast is always safe.

llvm-project/llvm/lib/IR/Instructions.cpp

Lines 3504 to 3569 in 857161c

    
           // Define the 144 possibilities for these two cast instructions. The values 
        
           // in this matrix determine what to do in a given situation and select the 
        
           // case in the switch below.  The rows correspond to firstOp, the columns 
        
           // correspond to secondOp.  In looking at the table below, keep in mind 
        
           // the following cast properties: 
        
           // 
        
           //          Size Compare       Source               Destination 
        
           // Operator  Src ? Size   Type       Sign         Type       Sign 
        
           // -------- ------------ -------------------   --------------------- 
        
           // TRUNC         >       Integer      Any        Integral     Any 
        
           // ZEXT          <       Integral   Unsigned     Integer      Any 
        
           // SEXT          <       Integral    Signed      Integer      Any 
        
           // FPTOUI       n/a      FloatPt      n/a        Integral   Unsigned 
        
           // FPTOSI       n/a      FloatPt      n/a        Integral    Signed 
        
           // UITOFP       n/a      Integral   Unsigned     FloatPt      n/a 
        
           // SITOFP       n/a      Integral    Signed      FloatPt      n/a 
        
           // FPTRUNC       >       FloatPt      n/a        FloatPt      n/a 
        
           // FPEXT         <       FloatPt      n/a        FloatPt      n/a 
        
           // PTRTOINT     n/a      Pointer      n/a        Integral   Unsigned 
        
           // INTTOPTR     n/a      Integral   Unsigned     Pointer      n/a 
        
           // BITCAST       =       FirstClass   n/a       FirstClass    n/a 
        
           // ADDRSPCST    n/a      Pointer      n/a        Pointer      n/a 
        
           // 
        
           // NOTE: some transforms are safe, but we consider them to be non-profitable. 
        
           // For example, we could merge "fptoui double to i32" + "zext i32 to i64", 
        
           // into "fptoui double to i64", but this loses information about the range 
        
           // of the produced value (we no longer know the top-part is all zeros). 
        
           // Further this conversion is often much more expensive for typical hardware, 
        
           // and causes issues when building libgcc.  We disallow fptosi+sext for the 
        
           // same reason. 
        
           const unsigned numCastOps = 
        
             Instruction::CastOpsEnd - Instruction::CastOpsBegin; 
        
           static const uint8_t CastResults[numCastOps][numCastOps] = { 
        
             // T        F  F  U  S  F  F  P  I  B  A  -+ 
        
             // R  Z  S  P  P  I  I  T  P  2  N  T  S   | 
        
             // U  E  E  2  2  2  2  R  E  I  T  C  C   +- secondOp 
        
             // N  X  X  U  S  F  F  N  X  N  2  V  V   | 
        
             // C  T  T  I  I  P  P  C  T  T  P  T  T  -+ 
        
             {  1, 0, 0,99,99, 0, 0,99,99,99, 0, 3, 0}, // Trunc         -+ 
        
             {  8, 1, 9,99,99, 2,17,99,99,99, 2, 3, 0}, // ZExt           | 
        
             {  8, 0, 1,99,99, 0, 2,99,99,99, 0, 3, 0}, // SExt           | 
        
             {  0, 0, 0,99,99, 0, 0,99,99,99, 0, 3, 0}, // FPToUI         | 
        
             {  0, 0, 0,99,99, 0, 0,99,99,99, 0, 3, 0}, // FPToSI         | 
        
             { 99,99,99, 0, 0,99,99, 0, 0,99,99, 4, 0}, // UIToFP         +- firstOp 
        
             { 99,99,99, 0, 0,99,99, 0, 0,99,99, 4, 0}, // SIToFP         | 
        
             { 99,99,99, 0, 0,99,99, 0, 0,99,99, 4, 0}, // FPTrunc        | 
        
             { 99,99,99, 2, 2,99,99, 8, 2,99,99, 4, 0}, // FPExt          | 
        
             {  1, 0, 0,99,99, 0, 0,99,99,99, 7, 3, 0}, // PtrToInt       | 
        
             { 99,99,99,99,99,99,99,99,99,11,99,15, 0}, // IntToPtr       | 
        
             {  5, 5, 5, 0, 0, 5, 5, 0, 0,16, 5, 1,14}, // BitCast        | 
        
             {  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,13,12}, // AddrSpaceCast -+ 
        
           }; 
        
           // TODO: This logic could be encoded into the table above and handled in the 
        
           // switch below. 
        
           // If either of the casts are a bitcast from scalar to vector, disallow the 
        
           // merging. However, any pair of bitcasts are allowed. 
        
           bool IsFirstBitcast  = (firstOp == Instruction::BitCast); 
        
           bool IsSecondBitcast = (secondOp == Instruction::BitCast); 
        
           bool AreBothBitcasts = IsFirstBitcast && IsSecondBitcast; 
        
           // Check if any of the casts convert scalars <-> vectors. 
        
           if ((IsFirstBitcast  && isa<VectorType>(SrcTy) != isa<VectorType>(MidTy)) || 
        
               (IsSecondBitcast && isa<VectorType>(MidTy) != isa<VectorType>(DstTy))) 
        
             if (!AreBothBitcasts) 
        
               return 0;

…ts before creating a new bitcast on top Encountered while working on llvm#67803, this helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run. I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

RKSimon · 2024-03-22T17:42:50Z

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

FYI, folding bitcast of bitcast is always safe.

It was mainly potential OneUse restrictions that concerned me. VectorCombine won't mind but I can imagine there are cases where it would be a problem.

RKSimon · 2024-03-25T16:47:18Z

Any further comments?

nikic · 2024-03-26T01:17:50Z

Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

Is that for further transforms in VectorCombine or somewhere else? Can we have a test that shows doing this optimization in VectorCombine has a benefit?

RKSimon · 2024-03-26T10:16:58Z

Its mainly a phaseordering problem for #67803 https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/PhaseOrdering/X86/pr67803.ll

; CHECK-NEXT:    [[TMP2:%.*]] = icmp sgt <8 x i32> [[TMP0]], [[TMP1]]
; CHECK-NEXT:    [[CMP_I21:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT:    [[SEXT_I22:%.*]] = sext <4 x i1> [[CMP_I21]] to <4 x i32>
; CHECK-NEXT:    [[CMP_I:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT:    [[SEXT_I:%.*]] = sext <4 x i1> [[CMP_I]] to <4 x i32>
; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <4 x i32> [[SEXT_I22]], <4 x i32> [[SEXT_I]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

The aim is to merge all this into [[TMP3:%.*]] sext <8 x i1> [[TMP2:%.*]] to <8 x i32> - unfortunately vectorcombine still doesn't see the [[TMP3]] line, as its still wrapped in bitcasts, which doesn't get addressed until an InstCombine run after the last VectorCombine run.

Once the peek through bitcasts is working, my intention is to add a vectorcombine fold for shuffle (ext x), (ext y) -> ext (shuffle x, y) (we already do something similar for binops), and after all that I can then update #58895 to peek through extract_subvector shuffles to finally be able to convert blendv intrinsics to selects.

dtcxzyw

LGTM.

RKSimon requested review from nikic, phoebewang and dtcxzyw March 21, 2024 13:45

llvmbot added vectorizers llvm:transforms labels Mar 21, 2024

phoebewang reviewed Mar 21, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/VectorCombine.cpp Outdated Show resolved Hide resolved

RKSimon force-pushed the bitcast-shuffle branch from 89881d2 to df919cb Compare March 22, 2024 14:39

dtcxzyw approved these changes Mar 28, 2024

View reviewed changes

RKSimon merged commit 1d06f41 into llvm:main Apr 2, 2024

RKSimon deleted the bitcast-shuffle branch April 2, 2024 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VectorCombine] foldBitcastShuffle - peek through any residual bitcasts before creating a new bitcast on top #86119

[VectorCombine] foldBitcastShuffle - peek through any residual bitcasts before creating a new bitcast on top #86119

RKSimon commented Mar 21, 2024

llvmbot commented Mar 21, 2024

dtcxzyw commented Mar 21, 2024

RKSimon commented Mar 22, 2024

RKSimon commented Mar 25, 2024

nikic commented Mar 26, 2024

RKSimon commented Mar 26, 2024

dtcxzyw left a comment

[VectorCombine] foldBitcastShuffle - peek through any residual bitcasts before creating a new bitcast on top #86119

[VectorCombine] foldBitcastShuffle - peek through any residual bitcasts before creating a new bitcast on top #86119

Conversation

RKSimon commented Mar 21, 2024

llvmbot commented Mar 21, 2024

dtcxzyw commented Mar 21, 2024

RKSimon commented Mar 22, 2024

RKSimon commented Mar 25, 2024

nikic commented Mar 26, 2024

RKSimon commented Mar 26, 2024

dtcxzyw left a comment

Choose a reason for hiding this comment