Skip to content

[VectorCombine] foldBitcastShuffle - peek through any residual bitcasts before creating a new bitcast on top #86119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Mar 21, 2024

Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

@llvmbot
Copy link
Member

llvmbot commented Mar 21, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Simon Pilgrim (RKSimon)

Changes

Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?


Full diff: https://github.com/llvm/llvm-project/pull/86119.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VectorCombine.cpp (+14-2)
  • (modified) llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll (+2-4)
  • (modified) llvm/test/Transforms/VectorCombine/X86/shuffle.ll (+2-4)
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index 7e86137f23f3c8..87e05dde545cd5 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -135,6 +135,18 @@ class VectorCombine {
 };
 } // namespace
 
+/// Return the source operand of a potentially bitcasted value while
+/// optionally checking if it has one use. If there is no bitcast or the one
+/// use check is not met, return the input value itself.
+static Value *peekThroughBitcast(Value *V, bool OneUseOnly = false) {
+  if (auto *BitCast = dyn_cast<BitCastInst>(V))
+    if (!OneUseOnly || BitCast->hasOneUse())
+      return BitCast->getOperand(0);
+
+  // V is not a bitcast or V has more than one use and OneUseOnly is true.
+  return V;
+}
+
 static bool canWidenLoad(LoadInst *Load, const TargetTransformInfo &TTI) {
   // Do not widen load if atomic/volatile or under asan/hwasan/memtag/tsan.
   // The widened load may load data from dirty regions or create data races
@@ -751,8 +763,8 @@ bool VectorCombine::foldBitcastShuffle(Instruction &I) {
 
   // bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'
   ++NumShufOfBitcast;
-  Value *CastV0 = Builder.CreateBitCast(V0, NewShuffleTy);
-  Value *CastV1 = Builder.CreateBitCast(V1, NewShuffleTy);
+  Value *CastV0 = Builder.CreateBitCast(peekThroughBitcast(V0), NewShuffleTy);
+  Value *CastV1 = Builder.CreateBitCast(peekThroughBitcast(V1), NewShuffleTy);
   Value *Shuf = Builder.CreateShuffleVector(CastV0, CastV1, NewMask);
   replaceValue(I, *Shuf);
   return true;
diff --git a/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll b/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll
index 8c5c6656ca1795..74a58c8d313611 100644
--- a/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/shuffle-inseltpoison.ll
@@ -133,8 +133,7 @@ define <2 x i64> @PR35454_1(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_1(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <16 x i8>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <16 x i8>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> <i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <16 x i8> [[BC1]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <16 x i8> [[ADD]] to <4 x i32>
@@ -164,8 +163,7 @@ define <2 x i64> @PR35454_2(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_2(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <8 x i16>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <8 x i16>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> poison, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <8 x i16> [[BC1]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <8 x i16> [[ADD]] to <4 x i32>
diff --git a/llvm/test/Transforms/VectorCombine/X86/shuffle.ll b/llvm/test/Transforms/VectorCombine/X86/shuffle.ll
index 60cfc4d4b07079..d1484fd5ab3399 100644
--- a/llvm/test/Transforms/VectorCombine/X86/shuffle.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/shuffle.ll
@@ -133,8 +133,7 @@ define <2 x i64> @PR35454_1(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_1(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <16 x i8>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <16 x i8>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> <i32 12, i32 13, i32 14, i32 15, i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <16 x i8> [[BC1]], <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <16 x i8> [[ADD]] to <4 x i32>
@@ -164,8 +163,7 @@ define <2 x i64> @PR35454_2(<2 x i64> %v) {
 ; SSE-NEXT:    ret <2 x i64> [[BC3]]
 ;
 ; AVX-LABEL: @PR35454_2(
-; AVX-NEXT:    [[BC:%.*]] = bitcast <2 x i64> [[V:%.*]] to <4 x i32>
-; AVX-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[BC]] to <8 x i16>
+; AVX-NEXT:    [[TMP1:%.*]] = bitcast <2 x i64> [[V:%.*]] to <8 x i16>
 ; AVX-NEXT:    [[BC1:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> poison, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1>
 ; AVX-NEXT:    [[ADD:%.*]] = shl <8 x i16> [[BC1]], <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
 ; AVX-NEXT:    [[BC2:%.*]] = bitcast <8 x i16> [[ADD]] to <4 x i32>

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 21, 2024

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

FYI, folding bitcast of bitcast is always safe.

// Define the 144 possibilities for these two cast instructions. The values
// in this matrix determine what to do in a given situation and select the
// case in the switch below. The rows correspond to firstOp, the columns
// correspond to secondOp. In looking at the table below, keep in mind
// the following cast properties:
//
// Size Compare Source Destination
// Operator Src ? Size Type Sign Type Sign
// -------- ------------ ------------------- ---------------------
// TRUNC > Integer Any Integral Any
// ZEXT < Integral Unsigned Integer Any
// SEXT < Integral Signed Integer Any
// FPTOUI n/a FloatPt n/a Integral Unsigned
// FPTOSI n/a FloatPt n/a Integral Signed
// UITOFP n/a Integral Unsigned FloatPt n/a
// SITOFP n/a Integral Signed FloatPt n/a
// FPTRUNC > FloatPt n/a FloatPt n/a
// FPEXT < FloatPt n/a FloatPt n/a
// PTRTOINT n/a Pointer n/a Integral Unsigned
// INTTOPTR n/a Integral Unsigned Pointer n/a
// BITCAST = FirstClass n/a FirstClass n/a
// ADDRSPCST n/a Pointer n/a Pointer n/a
//
// NOTE: some transforms are safe, but we consider them to be non-profitable.
// For example, we could merge "fptoui double to i32" + "zext i32 to i64",
// into "fptoui double to i64", but this loses information about the range
// of the produced value (we no longer know the top-part is all zeros).
// Further this conversion is often much more expensive for typical hardware,
// and causes issues when building libgcc. We disallow fptosi+sext for the
// same reason.
const unsigned numCastOps =
Instruction::CastOpsEnd - Instruction::CastOpsBegin;
static const uint8_t CastResults[numCastOps][numCastOps] = {
// T F F U S F F P I B A -+
// R Z S P P I I T P 2 N T S |
// U E E 2 2 2 2 R E I T C C +- secondOp
// N X X U S F F N X N 2 V V |
// C T T I I P P C T T P T T -+
{ 1, 0, 0,99,99, 0, 0,99,99,99, 0, 3, 0}, // Trunc -+
{ 8, 1, 9,99,99, 2,17,99,99,99, 2, 3, 0}, // ZExt |
{ 8, 0, 1,99,99, 0, 2,99,99,99, 0, 3, 0}, // SExt |
{ 0, 0, 0,99,99, 0, 0,99,99,99, 0, 3, 0}, // FPToUI |
{ 0, 0, 0,99,99, 0, 0,99,99,99, 0, 3, 0}, // FPToSI |
{ 99,99,99, 0, 0,99,99, 0, 0,99,99, 4, 0}, // UIToFP +- firstOp
{ 99,99,99, 0, 0,99,99, 0, 0,99,99, 4, 0}, // SIToFP |
{ 99,99,99, 0, 0,99,99, 0, 0,99,99, 4, 0}, // FPTrunc |
{ 99,99,99, 2, 2,99,99, 8, 2,99,99, 4, 0}, // FPExt |
{ 1, 0, 0,99,99, 0, 0,99,99,99, 7, 3, 0}, // PtrToInt |
{ 99,99,99,99,99,99,99,99,99,11,99,15, 0}, // IntToPtr |
{ 5, 5, 5, 0, 0, 5, 5, 0, 0,16, 5, 1,14}, // BitCast |
{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,13,12}, // AddrSpaceCast -+
};
// TODO: This logic could be encoded into the table above and handled in the
// switch below.
// If either of the casts are a bitcast from scalar to vector, disallow the
// merging. However, any pair of bitcasts are allowed.
bool IsFirstBitcast = (firstOp == Instruction::BitCast);
bool IsSecondBitcast = (secondOp == Instruction::BitCast);
bool AreBothBitcasts = IsFirstBitcast && IsSecondBitcast;
// Check if any of the casts convert scalars <-> vectors.
if ((IsFirstBitcast && isa<VectorType>(SrcTy) != isa<VectorType>(MidTy)) ||
(IsSecondBitcast && isa<VectorType>(MidTy) != isa<VectorType>(DstTy)))
if (!AreBothBitcasts)
return 0;

…ts before creating a new bitcast on top

Encountered while working on llvm#67803, this helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?
@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 22, 2024

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

FYI, folding bitcast of bitcast is always safe.

It was mainly potential OneUse restrictions that concerned me. VectorCombine won't mind but I can imagine there are cases where it would be a problem.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 25, 2024

Any further comments?

@nikic
Copy link
Contributor

nikic commented Mar 26, 2024

Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

Is that for further transforms in VectorCombine or somewhere else? Can we have a test that shows doing this optimization in VectorCombine has a benefit?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 26, 2024

Its mainly a phaseordering problem for #67803 https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/PhaseOrdering/X86/pr67803.ll

; CHECK-NEXT:    [[TMP2:%.*]] = icmp sgt <8 x i32> [[TMP0]], [[TMP1]]
; CHECK-NEXT:    [[CMP_I21:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT:    [[SEXT_I22:%.*]] = sext <4 x i1> [[CMP_I21]] to <4 x i32>
; CHECK-NEXT:    [[CMP_I:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT:    [[SEXT_I:%.*]] = sext <4 x i1> [[CMP_I]] to <4 x i32>
; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <4 x i32> [[SEXT_I22]], <4 x i32> [[SEXT_I]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

The aim is to merge all this into [[TMP3:%.*]] sext <8 x i1> [[TMP2:%.*]] to <8 x i32> - unfortunately vectorcombine still doesn't see the [[TMP3]] line, as its still wrapped in bitcasts, which doesn't get addressed until an InstCombine run after the last VectorCombine run.

Once the peek through bitcasts is working, my intention is to add a vectorcombine fold for shuffle (ext x), (ext y) -> ext (shuffle x, y) (we already do something similar for binops), and after all that I can then update #58895 to peek through extract_subvector shuffles to finally be able to convert blendv intrinsics to selects.

Copy link
Member

@dtcxzyw dtcxzyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@RKSimon RKSimon merged commit 1d06f41 into llvm:main Apr 2, 2024
@RKSimon RKSimon deleted the bitcast-shuffle branch April 2, 2024 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants