-
Notifications
You must be signed in to change notification settings - Fork 13.7k
[LoopVectorize] Use CodeSize as the cost kind for minsize #124119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Functions marked with minsize should aim for minimum code size, so the vectorizer should use CodeSize for the cost kind and also the cost we compare should be the cost for the entire loop: it shouldn't be divided by the number of vector elements and block costs shouldn't be divided by the block probability. Possibly we should also be doing this for optsize as well, but there are a lot of tests that assume the current behaviour and the definition of optsize is less clear than minsize (for minsize the goal is to "keep the code size of this function as small as possible" whereas for optsize it's "keep the code size of this function low").
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: John Brawn (john-brawn-arm) ChangesFunctions marked with minsize should aim for minimum code size, so the vectorizer should use CodeSize for the cost kind and also the cost we compare should be the cost for the entire loop: it shouldn't be divided by the number of vector elements and block costs shouldn't be divided by the block probability. Possibly we should also be doing this for optsize as well, but there are a lot of tests that assume the current behaviour and the definition of optsize is less clear than minsize (for minsize the goal is to "keep the code size of this function as small as possible" whereas for optsize it's "keep the code size of this function low"). Patch is 140.13 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/124119.diff 4 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 7167e2179af535..9a617fc66cb935 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -978,7 +978,9 @@ class LoopVectorizationCostModel {
InterleavedAccessInfo &IAI)
: ScalarEpilogueStatus(SEL), TheLoop(L), PSE(PSE), LI(LI), Legal(Legal),
TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), TheFunction(F),
- Hints(Hints), InterleaveInfo(IAI), CostKind(TTI::TCK_RecipThroughput) {}
+ Hints(Hints), InterleaveInfo(IAI) {
+ CostKind = F->hasMinSize() ? TTI::TCK_CodeSize : TTI::TCK_RecipThroughput;
+ }
/// \return An upper bound for the vectorization factors (both fixed and
/// scalable). If the factors are 0, vectorization and interleaving should be
@@ -4277,6 +4279,13 @@ bool LoopVectorizationPlanner::isMoreProfitable(
EstimatedWidthB *= *VScale;
}
+ // When optimizing for size choose whichever is smallest, which will be the
+ // one with the smallest cost for the whole loop. On a tie pick the larger
+ // vector width, on the assumption that throughput will be greater.
+ if (CM.CostKind == TTI::TCK_CodeSize)
+ return CostA < CostB ||
+ (CostA == CostB && EstimatedWidthA > EstimatedWidthB);
+
// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width
// vectorization.
@@ -5506,7 +5515,8 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
}
// Scale the total scalar cost by block probability.
- ScalarCost /= getReciprocalPredBlockProb();
+ if (CostKind != TTI::TCK_CodeSize)
+ ScalarCost /= getReciprocalPredBlockProb();
// Compute the discount. A non-negative discount means the vector version
// of the instruction costs more, and scalarizing would be beneficial.
@@ -5558,7 +5568,8 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
// the predicated block, if it is an if-else block. Thus, scale the block's
// cost by the probability of executing it. blockNeedsPredication from
// Legal is used so as to not include all blocks in tail folded loops.
- if (VF.isScalar() && Legal->blockNeedsPredication(BB))
+ if (VF.isScalar() && Legal->blockNeedsPredication(BB) &&
+ CostKind != TTI::TCK_CodeSize)
BlockCost /= getReciprocalPredBlockProb();
Cost += BlockCost;
@@ -5637,7 +5648,8 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
// conditional branches, but may not be executed for each vector lane. Scale
// the cost by the probability of executing the predicated block.
if (isPredicatedInst(I)) {
- Cost /= getReciprocalPredBlockProb();
+ if (CostKind != TTI::TCK_CodeSize)
+ Cost /= getReciprocalPredBlockProb();
// Add the cost of an i1 extract and a branch
auto *VecI1Ty =
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index f1228368804beb..b92bfbe716855a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -793,7 +793,7 @@ InstructionCost VPRegionBlock::cost(ElementCount VF, VPCostContext &Ctx) {
// For the scalar case, we may not always execute the original predicated
// block, Thus, scale the block's cost by the probability of executing it.
- if (VF.isScalar())
+ if (VF.isScalar() && Ctx.CostKind != TTI::TCK_CodeSize)
return ThenCost / getReciprocalPredBlockProb();
return ThenCost;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/optsize_minsize.ll b/llvm/test/Transforms/LoopVectorize/AArch64/optsize_minsize.ll
new file mode 100644
index 00000000000000..37dd63b5c093da
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/optsize_minsize.ll
@@ -0,0 +1,1067 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; The tests here check for differences in behaviour between the default,
+; optsize, and minsize.
+; RUN: opt -passes=loop-vectorize -S < %s | FileCheck %s --check-prefix=DEFAULT
+; RUN: opt -passes=forceattrs,loop-vectorize -force-attribute=optsize -S < %s | FileCheck %s --check-prefix=OPTSIZE
+; RUN: opt -passes=forceattrs,loop-vectorize -force-attribute=minsize -S < %s | FileCheck %s --check-prefix=MINSIZE
+
+target triple = "aarch64-unknown-linux-gnu"
+
+@A = global [1000 x i16] zeroinitializer, align 2
+@B = global [1000 x i32] zeroinitializer, align 4
+@C = global [1000 x i32] zeroinitializer, align 4
+
+; This should always vectorize, as using vector instructions eliminates the loop
+; which is both faster and smaller (a scalar version is emitted, but the branch
+; to it is false and it's later removed).
+define void @always_vectorize(ptr %p, i32 %x) {
+; DEFAULT-LABEL: define void @always_vectorize(
+; DEFAULT-SAME: ptr [[P:%.*]], i32 [[X:%.*]]) {
+; DEFAULT-NEXT: [[ENTRY:.*]]:
+; DEFAULT-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; DEFAULT: [[VECTOR_PH]]:
+; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
+; DEFAULT: [[VECTOR_BODY]]:
+; DEFAULT-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 0
+; DEFAULT-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
+; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[X]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; DEFAULT-NEXT: [[TMP3:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; DEFAULT-NEXT: [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
+; DEFAULT-NEXT: store <4 x i32> [[TMP3]], ptr [[TMP5]], align 4
+; DEFAULT-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; DEFAULT: [[MIDDLE_BLOCK]]:
+; DEFAULT-NEXT: br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; DEFAULT: [[SCALAR_PH]]:
+; DEFAULT-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 4, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; DEFAULT-NEXT: br label %[[FOR_BODY:.*]]
+; DEFAULT: [[FOR_BODY]]:
+; DEFAULT-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; DEFAULT-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 [[INDVARS_IV]]
+; DEFAULT-NEXT: [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; DEFAULT-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP4]], [[X]]
+; DEFAULT-NEXT: store i32 [[ADD]], ptr [[ARRAYIDX]], align 4
+; DEFAULT-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; DEFAULT-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 4
+; DEFAULT-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; DEFAULT: [[FOR_COND_CLEANUP]]:
+; DEFAULT-NEXT: ret void
+;
+; OPTSIZE-LABEL: define void @always_vectorize(
+; OPTSIZE-SAME: ptr [[P:%.*]], i32 [[X:%.*]]) #[[ATTR0:[0-9]+]] {
+; OPTSIZE-NEXT: [[ENTRY:.*]]:
+; OPTSIZE-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; OPTSIZE: [[VECTOR_PH]]:
+; OPTSIZE-NEXT: br label %[[VECTOR_BODY:.*]]
+; OPTSIZE: [[VECTOR_BODY]]:
+; OPTSIZE-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 0
+; OPTSIZE-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
+; OPTSIZE-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; OPTSIZE-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[X]], i64 0
+; OPTSIZE-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; OPTSIZE-NEXT: [[TMP3:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; OPTSIZE-NEXT: [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
+; OPTSIZE-NEXT: store <4 x i32> [[TMP3]], ptr [[TMP5]], align 4
+; OPTSIZE-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; OPTSIZE: [[MIDDLE_BLOCK]]:
+; OPTSIZE-NEXT: br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; OPTSIZE: [[SCALAR_PH]]:
+; OPTSIZE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 4, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; OPTSIZE-NEXT: br label %[[FOR_BODY:.*]]
+; OPTSIZE: [[FOR_BODY]]:
+; OPTSIZE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; OPTSIZE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 [[INDVARS_IV]]
+; OPTSIZE-NEXT: [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; OPTSIZE-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP4]], [[X]]
+; OPTSIZE-NEXT: store i32 [[ADD]], ptr [[ARRAYIDX]], align 4
+; OPTSIZE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; OPTSIZE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 4
+; OPTSIZE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; OPTSIZE: [[FOR_COND_CLEANUP]]:
+; OPTSIZE-NEXT: ret void
+;
+; MINSIZE-LABEL: define void @always_vectorize(
+; MINSIZE-SAME: ptr [[P:%.*]], i32 [[X:%.*]]) #[[ATTR0:[0-9]+]] {
+; MINSIZE-NEXT: [[ENTRY:.*]]:
+; MINSIZE-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; MINSIZE: [[VECTOR_PH]]:
+; MINSIZE-NEXT: br label %[[VECTOR_BODY:.*]]
+; MINSIZE: [[VECTOR_BODY]]:
+; MINSIZE-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 0
+; MINSIZE-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 0
+; MINSIZE-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP1]], align 4
+; MINSIZE-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[X]], i64 0
+; MINSIZE-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; MINSIZE-NEXT: [[TMP2:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; MINSIZE-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i32 0
+; MINSIZE-NEXT: store <4 x i32> [[TMP2]], ptr [[TMP3]], align 4
+; MINSIZE-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; MINSIZE: [[MIDDLE_BLOCK]]:
+; MINSIZE-NEXT: br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; MINSIZE: [[SCALAR_PH]]:
+; MINSIZE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 4, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; MINSIZE-NEXT: br label %[[FOR_BODY:.*]]
+; MINSIZE: [[FOR_BODY]]:
+; MINSIZE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; MINSIZE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 [[INDVARS_IV]]
+; MINSIZE-NEXT: [[TMP4:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; MINSIZE-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP4]], [[X]]
+; MINSIZE-NEXT: store i32 [[ADD]], ptr [[ARRAYIDX]], align 4
+; MINSIZE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; MINSIZE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 4
+; MINSIZE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; MINSIZE: [[FOR_COND_CLEANUP]]:
+; MINSIZE-NEXT: ret void
+;
+entry:
+ br label %for.body
+
+for.body:
+ %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+ %arrayidx = getelementptr inbounds i32, ptr %p, i64 %indvars.iv
+ %0 = load i32, ptr %arrayidx, align 4
+ %add = add nsw i32 %0, %x
+ store i32 %add, ptr %arrayidx, align 4
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, 4
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+
+for.cond.cleanup:
+ ret void
+}
+
+; This should vectorize only without optsize, as it needs a scalar version
+; which increases code size.
+define void @vectorize_without_optsize(ptr %p, i32 %x, i64 %n) {
+; DEFAULT-LABEL: define void @vectorize_without_optsize(
+; DEFAULT-SAME: ptr [[P:%.*]], i32 [[X:%.*]], i64 [[N:%.*]]) {
+; DEFAULT-NEXT: [[ENTRY:.*]]:
+; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 8
+; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; DEFAULT: [[VECTOR_PH]]:
+; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], 8
+; DEFAULT-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[X]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
+; DEFAULT: [[VECTOR_BODY]]:
+; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; DEFAULT-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
+; DEFAULT-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 [[TMP0]]
+; DEFAULT-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
+; DEFAULT-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 4
+; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; DEFAULT-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i32>, ptr [[TMP3]], align 4
+; DEFAULT-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; DEFAULT-NEXT: [[TMP5:%.*]] = add nsw <4 x i32> [[WIDE_LOAD1]], [[BROADCAST_SPLAT]]
+; DEFAULT-NEXT: store <4 x i32> [[TMP4]], ptr [[TMP2]], align 4
+; DEFAULT-NEXT: store <4 x i32> [[TMP5]], ptr [[TMP3]], align 4
+; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; DEFAULT-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; DEFAULT-NEXT: br i1 [[TMP6]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; DEFAULT: [[MIDDLE_BLOCK]]:
+; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; DEFAULT-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
+; DEFAULT: [[SCALAR_PH]]:
+; DEFAULT-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; DEFAULT-NEXT: br label %[[FOR_BODY:.*]]
+; DEFAULT: [[FOR_BODY]]:
+; DEFAULT-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; DEFAULT-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 [[INDVARS_IV]]
+; DEFAULT-NEXT: [[TMP7:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; DEFAULT-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP7]], [[X]]
+; DEFAULT-NEXT: store i32 [[ADD]], ptr [[ARRAYIDX]], align 4
+; DEFAULT-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; DEFAULT-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; DEFAULT-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; DEFAULT: [[FOR_COND_CLEANUP]]:
+; DEFAULT-NEXT: ret void
+;
+; OPTSIZE-LABEL: define void @vectorize_without_optsize(
+; OPTSIZE-SAME: ptr [[P:%.*]], i32 [[X:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; OPTSIZE-NEXT: [[ENTRY:.*]]:
+; OPTSIZE-NEXT: br label %[[FOR_BODY:.*]]
+; OPTSIZE: [[FOR_BODY]]:
+; OPTSIZE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; OPTSIZE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 [[INDVARS_IV]]
+; OPTSIZE-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; OPTSIZE-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP0]], [[X]]
+; OPTSIZE-NEXT: store i32 [[ADD]], ptr [[ARRAYIDX]], align 4
+; OPTSIZE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; OPTSIZE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; OPTSIZE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY]]
+; OPTSIZE: [[FOR_COND_CLEANUP]]:
+; OPTSIZE-NEXT: ret void
+;
+; MINSIZE-LABEL: define void @vectorize_without_optsize(
+; MINSIZE-SAME: ptr [[P:%.*]], i32 [[X:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; MINSIZE-NEXT: [[ENTRY:.*]]:
+; MINSIZE-NEXT: br label %[[FOR_BODY:.*]]
+; MINSIZE: [[FOR_BODY]]:
+; MINSIZE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; MINSIZE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 [[INDVARS_IV]]
+; MINSIZE-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
+; MINSIZE-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP0]], [[X]]
+; MINSIZE-NEXT: store i32 [[ADD]], ptr [[ARRAYIDX]], align 4
+; MINSIZE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; MINSIZE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; MINSIZE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY]]
+; MINSIZE: [[FOR_COND_CLEANUP]]:
+; MINSIZE-NEXT: ret void
+;
+entry:
+ br label %for.body
+
+for.body:
+ %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+ %arrayidx = getelementptr inbounds i32, ptr %p, i64 %indvars.iv
+ %0 = load i32, ptr %arrayidx, align 4
+ %add = add nsw i32 %0, %x
+ store i32 %add, ptr %arrayidx, align 4
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, %n
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+
+for.cond.cleanup:
+ ret void
+}
+
+; This should be vectorized and tail predicated without optsize, as that's
+; faster, but not with optsize, as it's much larger.
+; FIXME: Currently we avoid tail predication only with minsize
+define void @tail_predicate_without_optsize(ptr %p, i8 %a, i8 %b, i8 %c, i32 %n) {
+; DEFAULT-LABEL: define void @tail_predicate_without_optsize(
+; DEFAULT-SAME: ptr [[P:%.*]], i8 [[A:%.*]], i8 [[B:%.*]], i8 [[C:%.*]], i32 [[N:%.*]]) {
+; DEFAULT-NEXT: [[ENTRY:.*]]:
+; DEFAULT-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; DEFAULT: [[VECTOR_PH]]:
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i8> poison, i8 [[A]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i8> [[BROADCAST_SPLATINSERT]], <16 x i8> poison, <16 x i32> zeroinitializer
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <16 x i8> poison, i8 [[B]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <16 x i8> [[BROADCAST_SPLATINSERT3]], <16 x i8> poison, <16 x i32> zeroinitializer
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT5:%.*]] = insertelement <16 x i8> poison, i8 [[C]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <16 x i8> [[BROADCAST_SPLATINSERT5]], <16 x i8> poison, <16 x i32> zeroinitializer
+; DEFAULT-NEXT: br label %[[VECTOR_BODY:.*]]
+; DEFAULT: [[VECTOR_BODY]]:
+; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE36:.*]] ]
+; DEFAULT-NEXT: [[VEC_IND:%.*]] = phi <16 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[PRED_STORE_CONTINUE36]] ]
+; DEFAULT-NEXT: [[VEC_IND1:%.*]] = phi <16 x i8> [ <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT2:%.*]], %[[PRED_STORE_CONTINUE36]] ]
+; DEFAULT-NEXT: [[TMP0:%.*]] = icmp ule <16 x i64> [[VEC_IND]], splat (i64 14)
+; DEFAULT-NEXT: [[TMP1:%.*]] = mul <16 x i8> [[BROADCAST_SPLAT]], [[VEC_IND1]]
+; DEFAULT-NEXT: [[TMP2:%.*]] = lshr <16 x i8> [[VEC_IND1]], splat (i8 1)
+; DEFAULT-NEXT: [[TMP3:%.*]] = mul <16 x i8> [[TMP2]], [[BROADCAST_SPLAT4]]
+; DEFAULT-NEXT: [[TMP4:%.*]] = add <16 x i8> [[TMP3]], [[TMP1]]
+; DEFAULT-NEXT: [[TMP5:%.*]] = lshr <16 x i8> [[VEC_IND1]], splat (i8 2)
+; DEFAULT-NEXT: [[TMP6:%.*]] = mul <16 x i8> [[TMP5]], [[BROADCAST_SPLAT6]]
+; DEFAULT-NEXT: [[TMP7:%.*]] = add <16 x i8> [[TMP4]], [[TMP6]]
+; DEFAULT-NEXT: [[TM...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything aarch64 or arm-specific about these changes. Can the tests be moved to the parent folder?
;. | ||
; DEFAULT: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]} | ||
; DEFAULT: [[META1]] = !{!"llvm.loop.unroll.runtime.disable"} | ||
; DEFAULT: [[META2]] = !{!"llvm.loop.isvectorized", i32 1} | ||
; DEFAULT: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]} | ||
; DEFAULT: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]} | ||
; DEFAULT: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]} | ||
; DEFAULT: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]], [[META2]]} | ||
; DEFAULT: [[LOOP7]] = distinct !{[[LOOP7]], [[META2]], [[META1]]} | ||
; DEFAULT: [[LOOP8]] = distinct !{[[LOOP8]], [[META1]], [[META2]]} | ||
; DEFAULT: [[LOOP9]] = distinct !{[[LOOP9]], [[META2]], [[META1]]} | ||
; DEFAULT: [[LOOP10]] = distinct !{[[LOOP10]], [[META1]], [[META2]]} | ||
;. | ||
; OPTSIZE: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]} | ||
; OPTSIZE: [[META1]] = !{!"llvm.loop.unroll.runtime.disable"} | ||
; OPTSIZE: [[META2]] = !{!"llvm.loop.isvectorized", i32 1} | ||
; OPTSIZE: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]} | ||
; OPTSIZE: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]} | ||
; OPTSIZE: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]} | ||
; OPTSIZE: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]], [[META2]]} | ||
; OPTSIZE: [[LOOP7]] = distinct !{[[LOOP7]], [[META2]], [[META1]]} | ||
; OPTSIZE: [[LOOP8]] = distinct !{[[LOOP8]], [[META1]], [[META2]]} | ||
;. | ||
; MINSIZE: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]} | ||
; MINSIZE: [[META1]] = !{!"llvm.loop.unroll.runtime.disable"} | ||
; MINSIZE: [[META2]] = !{!"llvm.loop.isvectorized", i32 1} | ||
; MINSIZE: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]} | ||
; MINSIZE: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]} | ||
; MINSIZE: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]} | ||
; MINSIZE: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]], [[META2]]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need these. Likewise for the other file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all generated by update_test_checks.py. I could remove them, but the next time the test is updated they would just be added back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always been asked to remove them so I just delete those lines from the file after updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally ask that existing metadata lines are removed (unless they're a deliberate part of the test); this is checking the output of opt
instead, which adds new metadata.
Removing the output line checks can be done by changing the version flag used by update_test_checks.py
at the top of the file -- instead of UTC_ARGS: --version 5
, try version 4 or maybe 3. There might be another flag you can use as well but I can't see it from the help text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like --check-globals none is the flag to do this.
Costs are calculated by TargetTransformInfo and so are target-dependent. The two test files are also expecting different things, e.g. in the dont_vectorize_with_minsize we expect different vectorization factors in arm and aarch64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the fix #118100?
Do you have any statistics/perf number on the impact of this?
@@ -5506,7 +5515,8 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount( | |||
} | |||
|
|||
// Scale the total scalar cost by block probability. | |||
ScalarCost /= getReciprocalPredBlockProb(); | |||
if (CostKind != TTI::TCK_CodeSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the checks here and below may not be covered by the existing tests. Would be good to add test coverage for some of them, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to come up with a test to specifically test this, as any test I come up with the cost is large enough that incorrectly halving it doesn't matter because it's still larger than the scalar cost. If getReciprocalPredBlockProb used the actual block probability, instead of assuming it's 0.5 as it currently does, then I could probably do it by setting the block probability to close to zero.
Yes, it doesn't fix #118100 due to that using -Os, which adds optsize not minsize. Compiling the test case in that with -Oz -fvectorize it looks like we currently don't vectorize for other reasons already.
Compiling the llvm-test-suite SingleSource and MultiSource benchmarks with
|
Ping. |
Ping. @fhahn, have I sufficiently addressed your comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/13911 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/190/builds/15358 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/14111 Here is the relevant piece of the build log for the reference
|
PR #124119 wasn't rebased & tested before merging. Update the failing tests.
Looks like this was causing some test failures due to it not begin rebased to the latest main before merging, which missed some tests updates. Should be fixed in 649f4dc. |
Thanks for fixing that. |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/12068 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/153/builds/24095 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/20614 Here is the relevant piece of the build log for the reference
|
Functions marked with minsize should aim for minimum code size, so the vectorizer should use CodeSize for the cost kind and also the cost we compare should be the cost for the entire loop: it shouldn't be divided by the number of vector elements and block costs shouldn't be divided by the block probability. Possibly we should also be doing this for optsize as well, but there are a lot of tests that assume the current behaviour and the definition of optsize is less clear than minsize (for minsize the goal is to "keep the code size of this function as small as possible" whereas for optsize it's "keep the code size of this function low").
PR llvm#124119 wasn't rebased & tested before merging. Update the failing tests.
Functions marked with minsize should aim for minimum code size, so the vectorizer should use CodeSize for the cost kind and also the cost we compare should be the cost for the entire loop: it shouldn't be divided by the number of vector elements and block costs shouldn't be divided by the block probability.
Possibly we should also be doing this for optsize as well, but there are a lot of tests that assume the current behaviour and the definition of optsize is less clear than minsize (for minsize the goal is to "keep the code size of this function as small as possible" whereas for optsize it's "keep the code size of this function low").