-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LAA] Be more careful when evaluating AddRecs at symbolic max BTC. #128061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-llvm-analysis Author: Florian Hahn (fhahn) ChangesEvaluating AR at the symbolic max BTC may wrap and create an expression that is less than the start of the AddRec due to wrapping (for example consider MaxBTC = -2). If that's the case, set ScEnd to -(EltSize + 1). ScEnd will get incremented by EltSize before returning, so this effectively sets ScEnd to unsigned max. Note that LAA separately checks that accesses cannot not wrap (52ded67, #127543), so unsigned max represents an upper bound. When there is a computable backedge-taken count, we are guaranteed to execute the number of iterations, and if any pointer would wrap it would be UB (or the access will never be executed, so cannot alias). It includes new tests from the previous discussion that show a case we wrap with a BTC, but it is UB due to the pointer after the object wrapping (in Note that an earlier version of the patch was shared as #106530, but I accidentally deleted the branch and now I cannot figure out how to reopen that PR. Full diff: https://github.com/llvm/llvm-project/pull/128061.diff 5 Files Affected:
diff --git a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
index cb6f47e3a76be..91802cc4361ae 100644
--- a/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
+++ b/llvm/include/llvm/Analysis/LoopAccessAnalysis.h
@@ -872,7 +872,7 @@ bool isConsecutiveAccess(Value *A, Value *B, const DataLayout &DL,
/// NoConflict = (P2.Start >= P1.End) || (P1.Start >= P2.End)
std::pair<const SCEV *, const SCEV *> getStartAndEndForAccess(
const Loop *Lp, const SCEV *PtrExpr, Type *AccessTy, const SCEV *MaxBECount,
- ScalarEvolution *SE,
+ const SCEV *SymbolicMaxBECount, ScalarEvolution *SE,
DenseMap<std::pair<const SCEV *, Type *>,
std::pair<const SCEV *, const SCEV *>> *PointerBounds);
diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index b461c41d29e84..5a8eedfa261d2 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -319,11 +319,14 @@ bool llvm::isDereferenceableAndAlignedInLoop(
const SCEV *MaxBECount =
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates)
: SE.getConstantMaxBackedgeTakenCount(L);
+ const SCEV *SymbolicMaxBECount =
+ Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates)
+ : SE.getConstantMaxBackedgeTakenCount(L);
if (isa<SCEVCouldNotCompute>(MaxBECount))
return false;
const auto &[AccessStart, AccessEnd] = getStartAndEndForAccess(
- L, PtrScev, LI->getType(), MaxBECount, &SE, nullptr);
+ L, PtrScev, LI->getType(), MaxBECount, SymbolicMaxBECount, &SE, nullptr);
if (isa<SCEVCouldNotCompute>(AccessStart) ||
isa<SCEVCouldNotCompute>(AccessEnd))
return false;
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index a1d91de3bb788..cdce1f1941c2f 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -190,7 +190,7 @@ RuntimeCheckingPtrGroup::RuntimeCheckingPtrGroup(
std::pair<const SCEV *, const SCEV *> llvm::getStartAndEndForAccess(
const Loop *Lp, const SCEV *PtrExpr, Type *AccessTy, const SCEV *MaxBECount,
- ScalarEvolution *SE,
+ const SCEV *SymbolicMaxBECount, ScalarEvolution *SE,
DenseMap<std::pair<const SCEV *, Type *>,
std::pair<const SCEV *, const SCEV *>> *PointerBounds) {
std::pair<const SCEV *, const SCEV *> *PtrBoundsPair;
@@ -206,11 +206,31 @@ std::pair<const SCEV *, const SCEV *> llvm::getStartAndEndForAccess(
const SCEV *ScStart;
const SCEV *ScEnd;
+ auto &DL = Lp->getHeader()->getDataLayout();
+ Type *IdxTy = DL.getIndexType(PtrExpr->getType());
+ const SCEV *EltSizeSCEV = SE->getStoreSizeOfExpr(IdxTy, AccessTy);
if (SE->isLoopInvariant(PtrExpr, Lp)) {
ScStart = ScEnd = PtrExpr;
} else if (auto *AR = dyn_cast<SCEVAddRecExpr>(PtrExpr)) {
ScStart = AR->getStart();
- ScEnd = AR->evaluateAtIteration(MaxBECount, *SE);
+ if (!isa<SCEVCouldNotCompute>(MaxBECount))
+ // Evaluating AR at an exact BTC is safe: LAA separately checks that
+ // accesses cannot wrap in the loop. If evaluating AR at BTC wraps, then
+ // the loop either triggers UB when executing a memory access with a
+ // poison pointer or the wrapping/poisoned pointer is not used.
+ ScEnd = AR->evaluateAtIteration(MaxBECount, *SE);
+ else {
+ // Evaluating AR at MaxBTC may wrap and create an expression that is less
+ // than the start of the AddRec due to wrapping (for example consider
+ // MaxBTC = -2). If that's the case, set ScEnd to -(EltSize + 1). ScEnd
+ // will get incremented by EltSize before returning, so this effectively
+ // sets ScEnd to unsigned max. Note that LAA separately checks that
+ // accesses cannot not wrap, so unsigned max represents an upper bound.
+ ScEnd = AR->evaluateAtIteration(SymbolicMaxBECount, *SE);
+ if (!SE->isKnownNonNegative(SE->getMinusSCEV(ScEnd, ScStart)))
+ ScEnd = SE->getNegativeSCEV(
+ SE->getAddExpr(EltSizeSCEV, SE->getOne(EltSizeSCEV->getType())));
+ }
const SCEV *Step = AR->getStepRecurrence(*SE);
// For expressions with negative step, the upper bound is ScStart and the
@@ -232,9 +252,6 @@ std::pair<const SCEV *, const SCEV *> llvm::getStartAndEndForAccess(
assert(SE->isLoopInvariant(ScEnd, Lp) && "ScEnd needs to be invariant");
// Add the size of the pointed element to ScEnd.
- auto &DL = Lp->getHeader()->getDataLayout();
- Type *IdxTy = DL.getIndexType(PtrExpr->getType());
- const SCEV *EltSizeSCEV = SE->getStoreSizeOfExpr(IdxTy, AccessTy);
ScEnd = SE->getAddExpr(ScEnd, EltSizeSCEV);
std::pair<const SCEV *, const SCEV *> Res = {ScStart, ScEnd};
@@ -250,9 +267,11 @@ void RuntimePointerChecking::insert(Loop *Lp, Value *Ptr, const SCEV *PtrExpr,
unsigned DepSetId, unsigned ASId,
PredicatedScalarEvolution &PSE,
bool NeedsFreeze) {
- const SCEV *MaxBECount = PSE.getSymbolicMaxBackedgeTakenCount();
+ const SCEV *SymbolicMaxBECount = PSE.getSymbolicMaxBackedgeTakenCount();
+ const SCEV *MaxBECount = PSE.getBackedgeTakenCount();
const auto &[ScStart, ScEnd] = getStartAndEndForAccess(
- Lp, PtrExpr, AccessTy, MaxBECount, PSE.getSE(), &DC.getPointerBounds());
+ Lp, PtrExpr, AccessTy, MaxBECount, SymbolicMaxBECount, PSE.getSE(),
+ &DC.getPointerBounds());
assert(!isa<SCEVCouldNotCompute>(ScStart) &&
!isa<SCEVCouldNotCompute>(ScEnd) &&
"must be able to compute both start and end expressions");
@@ -1933,11 +1952,14 @@ MemoryDepChecker::getDependenceDistanceStrideAndSize(
// required for correctness.
if (SE.isLoopInvariant(Src, InnermostLoop) ||
SE.isLoopInvariant(Sink, InnermostLoop)) {
- const SCEV *MaxBECount = PSE.getSymbolicMaxBackedgeTakenCount();
+ const SCEV *MaxBECount = PSE.getBackedgeTakenCount();
+ const SCEV *SymbolicMaxBECount = PSE.getSymbolicMaxBackedgeTakenCount();
const auto &[SrcStart_, SrcEnd_] = getStartAndEndForAccess(
- InnermostLoop, Src, ATy, MaxBECount, PSE.getSE(), &PointerBounds);
+ InnermostLoop, Src, ATy, MaxBECount, SymbolicMaxBECount, PSE.getSE(),
+ &PointerBounds);
const auto &[SinkStart_, SinkEnd_] = getStartAndEndForAccess(
- InnermostLoop, Sink, BTy, MaxBECount, PSE.getSE(), &PointerBounds);
+ InnermostLoop, Sink, BTy, MaxBECount, SymbolicMaxBECount, PSE.getSE(),
+ &PointerBounds);
if (!isa<SCEVCouldNotCompute>(SrcStart_) &&
!isa<SCEVCouldNotCompute>(SrcEnd_) &&
!isa<SCEVCouldNotCompute>(SinkStart_) &&
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/evaluate-at-backedge-taken-count-wrapping.ll b/llvm/test/Analysis/LoopAccessAnalysis/evaluate-at-backedge-taken-count-wrapping.ll
new file mode 100644
index 0000000000000..d58dd38d9fef8
--- /dev/null
+++ b/llvm/test/Analysis/LoopAccessAnalysis/evaluate-at-backedge-taken-count-wrapping.ll
@@ -0,0 +1,92 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes='print<access-info>' -disable-output %s 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
+
+; Note: The datalayout for the test specifies a 32 bit index type.
+
+; No UB: accessing last valid byte, pointer after the object
+; doesnt wrap (%p + 2147483647).
+define void @pointer_after_object_does_not_wrap(i32 %y, ptr %s, ptr %p) {
+; CHECK-LABEL: 'pointer_after_object_does_not_wrap'
+; CHECK-NEXT: loop:
+; CHECK-NEXT: Memory dependences are safe with run-time checks
+; CHECK-NEXT: Dependences:
+; CHECK-NEXT: Run-time memory checks:
+; CHECK-NEXT: Check 0:
+; CHECK-NEXT: Comparing group ([[GRP1:0x[0-9a-f]+]]):
+; CHECK-NEXT: %gep2.iv = getelementptr inbounds i8, ptr %p, i32 %iv
+; CHECK-NEXT: Against group ([[GRP2:0x[0-9a-f]+]]):
+; CHECK-NEXT: %gep1.iv = getelementptr inbounds i8, ptr %s, i32 %iv
+; CHECK-NEXT: Grouped accesses:
+; CHECK-NEXT: Group [[GRP1]]:
+; CHECK-NEXT: (Low: (%y + %p) High: (2147483647 + %p))
+; CHECK-NEXT: Member: {(%y + %p),+,1}<nw><%loop>
+; CHECK-NEXT: Group [[GRP2]]:
+; CHECK-NEXT: (Low: (%y + %s) High: (2147483647 + %s))
+; CHECK-NEXT: Member: {(%y + %s),+,1}<nw><%loop>
+; CHECK-EMPTY:
+; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
+; CHECK-NEXT: SCEV assumptions:
+; CHECK-EMPTY:
+; CHECK-NEXT: Expressions re-written:
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i32 [ %y, %entry ], [ %iv.next, %loop ]
+ %gep1.iv = getelementptr inbounds i8 , ptr %s, i32 %iv
+ %load = load i8, ptr %gep1.iv, align 4
+ %gep2.iv = getelementptr inbounds i8, ptr %p, i32 %iv
+ store i8 %load, ptr %gep2.iv, align 4
+ %iv.next = add nsw i32 %iv, 1
+ %c.2 = icmp slt i32 %iv.next, 2147483647
+ br i1 %c.2, label %loop, label %exit
+
+exit:
+ ret void
+}
+
+; UB: accessing %p + 2147483646 and p + 2147483647.
+; Pointer the past the object would wrap in signed.
+define void @pointer_after_object_would_wrap(i32 %y, ptr %s, ptr %p) {
+; CHECK-LABEL: 'pointer_after_object_would_wrap'
+; CHECK-NEXT: loop:
+; CHECK-NEXT: Memory dependences are safe with run-time checks
+; CHECK-NEXT: Dependences:
+; CHECK-NEXT: Run-time memory checks:
+; CHECK-NEXT: Check 0:
+; CHECK-NEXT: Comparing group ([[GRP3:0x[0-9a-f]+]]):
+; CHECK-NEXT: %gep2.iv = getelementptr inbounds i8, ptr %p, i32 %iv
+; CHECK-NEXT: Against group ([[GRP4:0x[0-9a-f]+]]):
+; CHECK-NEXT: %gep1.iv = getelementptr inbounds i8, ptr %s, i32 %iv
+; CHECK-NEXT: Grouped accesses:
+; CHECK-NEXT: Group [[GRP3]]:
+; CHECK-NEXT: (Low: (%y + %p) High: (-2147483648 + %p))
+; CHECK-NEXT: Member: {(%y + %p),+,1}<nw><%loop>
+; CHECK-NEXT: Group [[GRP4]]:
+; CHECK-NEXT: (Low: (%y + %s) High: (-2147483648 + %s))
+; CHECK-NEXT: Member: {(%y + %s),+,1}<nw><%loop>
+; CHECK-EMPTY:
+; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
+; CHECK-NEXT: SCEV assumptions:
+; CHECK-EMPTY:
+; CHECK-NEXT: Expressions re-written:
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i32 [ %y, %entry ], [ %iv.next, %loop ]
+ %gep1.iv = getelementptr inbounds i8 , ptr %s, i32 %iv
+ %load = load i16, ptr %gep1.iv, align 4
+ %gep2.iv = getelementptr inbounds i8, ptr %p, i32 %iv
+ store i16 %load, ptr %gep2.iv, align 4
+ %iv.next = add nsw i32 %iv, 1
+ %c.2 = icmp slt i32 %iv.next, 2147483647
+ br i1 %c.2, label %loop, label %exit
+
+exit:
+ ret void
+}
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/evaluate-at-symbolic-max-backedge-taken-count-may-wrap.ll b/llvm/test/Analysis/LoopAccessAnalysis/evaluate-at-symbolic-max-backedge-taken-count-may-wrap.ll
index dd06cab26d095..0aa74c7b6442b 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/evaluate-at-symbolic-max-backedge-taken-count-may-wrap.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/evaluate-at-symbolic-max-backedge-taken-count-may-wrap.ll
@@ -3,7 +3,6 @@
target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
-; FIXME: Start == End for access group with AddRec.
define void @runtime_checks_with_symbolic_max_btc_neg_1(ptr %P, ptr %S, i32 %x, i32 %y) {
; CHECK-LABEL: 'runtime_checks_with_symbolic_max_btc_neg_1'
; CHECK-NEXT: loop:
@@ -17,7 +16,7 @@ define void @runtime_checks_with_symbolic_max_btc_neg_1(ptr %P, ptr %S, i32 %x,
; CHECK-NEXT: ptr %S
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group [[GRP1]]:
-; CHECK-NEXT: (Low: ((4 * %y) + %P) High: ((4 * %y) + %P))
+; CHECK-NEXT: (Low: ((4 * %y) + %P) High: -1)
; CHECK-NEXT: Member: {((4 * %y) + %P),+,4}<%loop>
; CHECK-NEXT: Group [[GRP2]]:
; CHECK-NEXT: (Low: %S High: (4 + %S))
@@ -44,7 +43,6 @@ exit:
ret void
}
-; FIXME: Start > End for access group with AddRec.
define void @runtime_check_with_symbolic_max_btc_neg_2(ptr %P, ptr %S, i32 %x, i32 %y) {
; CHECK-LABEL: 'runtime_check_with_symbolic_max_btc_neg_2'
; CHECK-NEXT: loop:
@@ -58,7 +56,7 @@ define void @runtime_check_with_symbolic_max_btc_neg_2(ptr %P, ptr %S, i32 %x, i
; CHECK-NEXT: ptr %S
; CHECK-NEXT: Grouped accesses:
; CHECK-NEXT: Group [[GRP3]]:
-; CHECK-NEXT: (Low: ((4 * %y) + %P) High: (-4 + (4 * %y) + %P))
+; CHECK-NEXT: (Low: ((4 * %y) + %P) High: -1)
; CHECK-NEXT: Member: {((4 * %y) + %P),+,4}<%loop>
; CHECK-NEXT: Group [[GRP4]]:
; CHECK-NEXT: (Low: %S High: (4 + %S))
|
b985f31
to
19fe791
Compare
I am not sure if there's a better way to check if evaluateAtIteration may wrap. It also currently always passes both BTC and SymbolicMaxBTC as there is one caller where PSE directly isn't available. This could probably also be improved once we agree on how to best prevent wrapping. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
||
target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64" | ||
|
||
; Note: The datalayout for the test specifies a 32 bit index type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alive proofs for the test cases showing the last accessed address doesn't have UB (@src1/@tgt1) and has UB (@src2/@tgt2): https://alive2.llvm.org/ce/z/EJVZep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side-note, but the tests might be a little easier to understand if we use a 8-bit index type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very confused about the code, although the tests seem to check out. BackedgeTakenInfo has an IsComplete
indicating whether SCEVCouldNotCompute
will be returned.
llvm/lib/Analysis/Loads.cpp
Outdated
@@ -319,11 +319,14 @@ bool llvm::isDereferenceableAndAlignedInLoop( | |||
const SCEV *MaxBECount = | |||
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates) | |||
: SE.getConstantMaxBackedgeTakenCount(L); | |||
const SCEV *SymbolicMaxBECount = | |||
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Constant/Symbolic/? Perhaps pass ExitKind to getPredicatedBackedgeTakenCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this should be constant, will fix, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this isn't the symbolic maximum, right? The documentation says:
/// A constant which provides an upper bound on the exact trip count.
ConstantMaximum,
/// An expression which provides an upper bound on the exact trip count.
SymbolicMaximum,
Surely, it should be called ConstMaxBECount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I got confused in the previous review. This evaluateAtIteration
wrapping is troubling me, and I'm not sure how it wraps: perhaps @nikic can chime in?
if (SE->isLoopInvariant(PtrExpr, Lp)) { | ||
ScStart = ScEnd = PtrExpr; | ||
} else if (auto *AR = dyn_cast<SCEVAddRecExpr>(PtrExpr)) { | ||
ScStart = AR->getStart(); | ||
ScEnd = AR->evaluateAtIteration(MaxBECount, *SE); | ||
if (!isa<SCEVCouldNotCompute>(BTC)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we are passing the exact BTC, and handling the case where it is a could-not-compute. Why not just pass the symbolic max as before, and have the logic below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can compute the back edge taken count, we are guaranteed to execute exactly that amount of iterations.
the symbolic max back edge taken count is an upper bound and the loop may exit at any earlier iteration (eg because it has an uncountable exit).
As per the comment, computable BTC means we should be able to rely on the fact that the pointers cannot wrap in any iteration. If we instead only have symbolic mac BTC, we may only execute a smaller number of iterations than the max, and then only those iterations are guaranteed to not wrap in general, so evaluating at the symbolic max may wrap.
One case to consider is when the symbolic max BTC is a SCEVUnknown, we will form a SCEvMultiply expression for which we cannot determine if it wraps or not (vs the case when the symbolic BTC is a constant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. My confusion is the following: if we have a computable BTC, isn't Exact = SymbolicMax? If we don't have a computable BTC, Exact = SCEVCouldNotCompute and SymbolicMax could be a SCEVConstant, general SCEV expression, SCEVUnknown, or SCEVCouldNotCompute, in the worst case. If my reasoning is correct, there is no additional information in Exact over the SymbolicMax, and we shouldn't have to pass Exact. In the test cases you have added, isn't SymbolicMax a SCEVConstant = INT_MAX? What does evaluating an AddRec at the INT_MAX iteration wrap to? Not -(EltSize + 1), or evaluating the AddRec at INT_MIN? Perhaps worth adding some SCEV tests for this evaluation, as a separate patch that we can verify?
When SymbolicMax is a SCEVUnknown, it means that the iteration is bounded by some function argument IR value, right? In this case, Exact will also be the same SCEVUnknown, and if we pass INT_MAX when calling the function, the evaluation will wrap, and this is UB anyway?
What happens when SymbolicMax is a SCEVCouldNotCompute? I think this will result in a crash with the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, just thinking out loud here: for simplicity, let AR = {0, +, 1}
and let SymbolicMax BTC = INT_MAX. Then, we compute AddExpr(0, MulExpr(1, INT_MAX))
. I don't think this overflows. Now, let AR = {0, + 2}
. Then, we compute AddExpr(0, MulExpr(2, BinomialCoefficient(INT_MAX, 2))
where the binomial coefficient evaluates to INT_MAX * (INT_MAX - 1) / 2
. Naively doing this would overflow even for BTC equal to sqrt(INT_MAX)
, but it looks like BinomialCoefficient
is written carefully, although the final result is truncated (?). In conclusion, it looks like the problem is that evaluateAtIteration
does not wrap, but rather truncates the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this! One clarification is that we evaluate at BTC = UNSIGNED_MAX. So {0, +, 1}
won't wrap, but adding 1
will (getStartAndEndForAccess
will compute the first address after the last access).
When we have strides larger than 1, the last accessed address will be something like %start + stride * UNSIGNED_MAX
, which should wrap to something like %start - %stride
. I am not entirely sure if there may be other wrapping issues with how evaluateAtIteration
internally computes the result, but the original end point computed for runtime_checks_with_symbolic_max_btc_neg_1
should illustrates that: start == end due to adding %stride to the result of evaluateAtIteration.
const SCEV *SymbolicMaxBTC = PSE.getSymbolicMaxBackedgeTakenCount(); | ||
const SCEV *BTC = PSE.getBackedgeTakenCount(); | ||
const auto &[ScStart, ScEnd] = | ||
getStartAndEndForAccess(Lp, PtrExpr, AccessTy, BTC, SymbolicMaxBTC, | ||
PSE.getSE(), &DC.getPointerBounds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we changing this because the exact BTC gives better results in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's changed to differentiate the cases where we can and cannot compute the BTC exactly (there may not be a computable BTC for loops with early exits)
ScEnd = AR->evaluateAtIteration(SymbolicMaxBTC, *SE); | ||
if (!SE->isKnownNonNegative(SE->getMinusSCEV(ScEnd, ScStart))) | ||
ScEnd = SE->getNegativeSCEV( | ||
SE->getAddExpr(EltSizeSCEV, SE->getOne(EltSizeSCEV->getType()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how evaluateAtIteration
overflows:
Result = SE.getAddExpr(Result, SE.getMulExpr(Operands[i], Coeff));
llvm/lib/Analysis/Loads.cpp
Outdated
@@ -319,11 +319,14 @@ bool llvm::isDereferenceableAndAlignedInLoop( | |||
const SCEV *MaxBECount = | |||
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates) | |||
: SE.getConstantMaxBackedgeTakenCount(L); | |||
const SCEV *SymbolicMaxBECount = | |||
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this isn't the symbolic maximum, right? The documentation says:
/// A constant which provides an upper bound on the exact trip count.
ConstantMaximum,
/// An expression which provides an upper bound on the exact trip count.
SymbolicMaximum,
Surely, it should be called ConstMaxBECount?
llvm/lib/Analysis/Loads.cpp
Outdated
@@ -319,11 +319,14 @@ bool llvm::isDereferenceableAndAlignedInLoop( | |||
const SCEV *MaxBECount = | |||
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates) | |||
: SE.getConstantMaxBackedgeTakenCount(L); | |||
const SCEV *SymbolicMaxBECount = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is identical to MaxBECount
so why not just pass MaxBECount
in as both arguments to getStartAndEndForAccess
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the naming to clarify the names, thanks
Not sure I understand the problem: the Loads caller has Predicates which we can pass, making it equivalent to a PSE call? |
d63d103
to
a5b5a13
Compare
// TODO: Use additional information to determine no-wrap including | ||
// size/dereferencability info from the accessed object. | ||
ScEnd = AR->evaluateAtIteration(MaxBTC, *SE); | ||
if (!SE->isKnownNonNegative(SE->getMinusSCEV(ScEnd, ScStart))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it sufficient to check that the difference is non-negative? Can't it happen that the addrec wraps but still ends up at a value > ScStart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also added a test to that effect now, this was an initial attempt to avoid regressions, which turned out to be a bit tricky to fix.
For now, I tried to check the object size if known to see if the maximum value of the add-rec will be inside the object in evaluateAddRecAtMaxBTCWillNotWrap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased, the latest version also fixes incorrectly determining that accesses in loops are dereferenceable (see dereferenceable-info-from-assumption-variable-size.ll
)
if (SE->isLoopInvariant(PtrExpr, Lp)) { | ||
ScStart = ScEnd = PtrExpr; | ||
} else if (auto *AR = dyn_cast<SCEVAddRecExpr>(PtrExpr)) { | ||
ScStart = AR->getStart(); | ||
ScEnd = AR->evaluateAtIteration(MaxBECount, *SE); | ||
if (!isa<SCEVCouldNotCompute>(BTC)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this! One clarification is that we evaluate at BTC = UNSIGNED_MAX. So {0, +, 1}
won't wrap, but adding 1
will (getStartAndEndForAccess
will compute the first address after the last access).
When we have strides larger than 1, the last accessed address will be something like %start + stride * UNSIGNED_MAX
, which should wrap to something like %start - %stride
. I am not entirely sure if there may be other wrapping issues with how evaluateAtIteration
internally computes the result, but the original end point computed for runtime_checks_with_symbolic_max_btc_neg_1
should illustrates that: start == end due to adding %stride to the result of evaluateAtIteration.
const SCEV *SymbolicMaxBTC = PSE.getSymbolicMaxBackedgeTakenCount(); | ||
const SCEV *BTC = PSE.getBackedgeTakenCount(); | ||
const auto &[ScStart, ScEnd] = | ||
getStartAndEndForAccess(Lp, PtrExpr, AccessTy, BTC, SymbolicMaxBTC, | ||
PSE.getSE(), &DC.getPointerBounds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's changed to differentiate the cases where we can and cannot compute the BTC exactly (there may not be a computable BTC for loops with early exits)
llvm/lib/Analysis/Loads.cpp
Outdated
@@ -319,11 +319,14 @@ bool llvm::isDereferenceableAndAlignedInLoop( | |||
const SCEV *MaxBECount = | |||
Predicates ? SE.getPredicatedConstantMaxBackedgeTakenCount(L, *Predicates) | |||
: SE.getConstantMaxBackedgeTakenCount(L); | |||
const SCEV *SymbolicMaxBECount = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the naming to clarify the names, thanks
@@ -1480,57 +1480,21 @@ define i64 @same_exit_block_pre_inc_use1_reverse() { | |||
; CHECK-NEXT: [[P2:%.*]] = alloca [1024 x i8], align 1 | |||
; CHECK-NEXT: call void @init_mem(ptr [[P1]], i64 1024) | |||
; CHECK-NEXT: call void @init_mem(ptr [[P2]], i64 1024) | |||
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a regression to me. Are you suggesting that we simply cannot ever check for dereferenceability in reverse loops for some fundamental reason or that the test itself was invalid? If it's the latter I'm happy to rewrite the test. :) It would be a shame to lose this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality for supporting testing whether loads could be dereferenced in reverse loops was added in #96752 specifically to support such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the original version wasn't handling this, but should be fixed in the latest version.
Extend test coverage for #128061.
Extend test coverage for llvm/llvm-project#128061.
Extend test coverage for llvm#128061.
Use dereferenceable attribute instead of assumption to make the tests independent of #128061.
…of assumption. Use dereferenceable attribute instead of assumption to make the tests independent of llvm/llvm-project#128061.
a5b5a13
to
86016b2
Compare
86016b2
to
6628dc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fhahn, thanks for this patch! I've left some comments so far - I just have to finish reviewing evaluateAddRecAtMaxBTCWillNotWrap
.
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header> | ||
; CHECK-NEXT: Group [[GRP4]]: | ||
; CHECK-NEXT: (Low: %A High: (2000 + %A)) | ||
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something here, but this seems too conservative right? The loop absolutely cannot execute more than 500 times even if we take an early exit. Even before the High of 2000 + %A
was a conservative worst case based on not taking an early exit. Is the idea to fix existing bugs for now in this patch and possibly improve upon this later? Or have I just missed some fundamental reasoning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we have a bound on the number of iterations and we are guaranteed to not exceed it.
One problematic input could be where only the first 10 elements of A and B are deferenceable, and the early exit leaves the loop at iteration 9. If we expand the bounds with B + 2000
, this may wrap in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the fact that the GEPs in the loop explicitly say there is no wrapping (due to inbounds), it sounds like this patch is saying that for early exit loops 2000+%A
could wrap (hence the need to clamp the value), but for normal loops 2000+%A
cannot wrap (hence we don't need to clamp). I'm just struggling to understand the reasoning behind this. Having said that, I'm happy to accept this limitation for now if it makes life easier and helps to solve the immediate issue.
I think it's worth adding a TODO that we can refine this in future. I think anyone looking at this test might be confused about the discrepancy between normal and early exit loops that have the same mathematical maximum upper bound for memory accesses. Also, after this patch if we have an early exit loop that requires runtime memory checks then there is a very high chance of failing those checks due to the increase in upper bound to UINT64_MAX so it's questionable whether there is even any point vectorising loops with runtime memory checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the fact that the GEPs in the loop explicitly say there is no wrapping (due to inbounds), it sounds like this patch is saying that for early exit loops 2000+%A could wrap (hence the need to clamp the value), but for normal loops 2000+%A cannot wrap (hence we don't need to clamp). I'm just struggling to understand the reasoning behind this. Having said that, I'm happy to accept this limitation for now if it makes life easier and helps to solve the immediate issue.
Technically the GEPs say that the result is poison if it is not inbounds. Dereferencing the poison GEP would trigger UB.
Say 1000+%A wraps. When there's no early exit, 2000+%A would still wrap, but guaranteed to trigger UB in the original loop, because the loop must access memory range %A..2000+%A.
If there is an early exit, the original loop may always exit before accessing 1000+%A, so would not trigger UB, but we would expand 2000+%A, which would have wrapped and the runtime check would incorrect.
Hope that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand. It sounds like you're saying that without the early exit we don't really care if the runtime checks are nonsense or not because the entire loop is UB anyway? Whereas for early exit loops the loop may or may not be UB and so we do care about getting the right runtime checks.
@@ -17,7 +17,7 @@ define void @runtime_checks_with_symbolic_max_btc_neg_1(ptr %P, ptr %S, i32 %x, | |||
; CHECK-NEXT: ptr %S | |||
; CHECK-NEXT: Grouped accesses: | |||
; CHECK-NEXT: Group [[GRP1]]: | |||
; CHECK-NEXT: (Low: ((4 * %y) + %P) High: ((4 * %y) + %P)) | |||
; CHECK-NEXT: (Low: ((4 * %y) + %P) High: inttoptr (i32 -1 to ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the FIXME comment about the function now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks
@@ -58,7 +58,7 @@ define void @runtime_check_with_symbolic_max_btc_neg_2(ptr %P, ptr %S, i32 %x, i | |||
; CHECK-NEXT: ptr %S | |||
; CHECK-NEXT: Grouped accesses: | |||
; CHECK-NEXT: Group [[GRP3]]: | |||
; CHECK-NEXT: (Low: ((4 * %y) + %P) High: (-4 + (4 * %y) + %P)) | |||
; CHECK-NEXT: (Low: ((4 * %y) + %P) High: inttoptr (i32 -1 to ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the FIXME comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these changes can be reverted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, undone, thanks
// than the start of the AddRec due to wrapping (for example consider | ||
// MaxBTC = -2). If that's the case, set ScEnd to -(EltSize + 1). ScEnd | ||
// will get incremented by EltSize before returning, so this effectively | ||
// sets ScEnd to unsigned max. Note that LAA separately checks that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps clearer to write set ScEnd to the maximum unsigned value for the type
? At the moment it could imply the max value for the C type unsigned
, or perhaps I'm just being too pedantic?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks. This is clearer, especially now that the code generates inttoptr -1
// will get incremented by EltSize before returning, so this effectively | ||
// sets ScEnd to unsigned max. Note that LAA separately checks that | ||
// accesses cannot not wrap, so unsigned max represents an upper bound. | ||
ScEnd = SE->getAddExpr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise you're probably trying to save on indentation here, but is it worth moving this into an else case below to avoid unnecessary computation? i.e.
if (evaluateAddRecAtMaxBTCWillNotWrap(AR, MaxBTC, *SE, DL))
ScEnd = AR->evaluateAtIteration(MaxBTC, *SE);
else {
ScEnd = SE->getAddExpr(...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks
// accesses cannot not wrap, so unsigned max represents an upper bound. | ||
ScEnd = SE->getAddExpr( | ||
SE->getNegativeSCEV(EltSizeSCEV), | ||
SE->getSCEV(ConstantExpr::getIntToPtr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it using getIntToPtr
here - doesn't it just need to be the same type as EltSizeSCEV
? It looks like we're creating an add expression where the first operand is an integer and the second is a pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScEnd
needs to be a pointer, and we cannot create a pointer constant for -1
. This is needed for ScEnd
to have the expected type.
const SCEV *MaxBTC, | ||
ScalarEvolution &SE, | ||
const DataLayout &DL) { | ||
auto *PointerBase = SE.getPointerBase(AR->getStart()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes AR
is for pointers, but the function name indicates the add rec could be anything. Is it worth making that clear in the function name? I'm not sure what would happen if we call SE.getPointerBase
for something that isn't a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it must be a pointer AddRec, and that's guaranteed from the current caller. Updated to evaluatePtrAddRecAtMaxBTCWillNotWrap
const SCEV *Step = AR->getStepRecurrence(SE); | ||
Type *WiderTy = SE.getWiderType(MaxBTC->getType(), Step->getType()); | ||
Step = SE.getNoopOrSignExtend(Step, WiderTy); | ||
MaxBTC = SE.getNoopOrSignExtend(MaxBTC, WiderTy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a sign-extend? Is there something that prevents the max btc in the original IR be unsigned? i.e.
%iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
...
possible jump to early exit
latch:
%iv.next = add nuw nsw i32 %iv, 1
%cond = icmp ult i32 %iv, %n
Or do we introduce SCEV checks before the loop to ensure that %n
never has the sign-bit set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sign bit may be set, but we are using the BTC as unsigned value so sign-extending should be conservatively correct.
Step = SE.getNoopOrSignExtend(Step, WiderTy); | ||
MaxBTC = SE.getNoopOrSignExtend(MaxBTC, WiderTy); | ||
if (SE.isKnownPositive(Step)) { | ||
// For positive steps, check if (AR->getStart() - StartPtr) + MaxBTC <= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out this patch and dumped out AR
to see what units it uses - it seems they are in units of bytes.
If AR->getStart()
and StartPtr
are pointers then isn't the difference also in units of bytes? So shouldn't we actually be doing
(AR->getStart() - StartPtr) + (MaxBTC * Step) <= DerefBytes
?
I'm not sure we can use an expression such as ((AR->getStart() - StartPtr) / Step) + MaxBTC <= DerefBytes
because the start offset may not be a multiple of Step
.
I assume the reason you're not including the element size of the loaded value in the calculation is because you only care about the actual pointer wrapping, not whether ptr + ElementSize - 1
could wrap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original check was a conservative over-approximation that tried to avoid introducing multiplies by step to avoid potential overflows in intermediate computations. But after looking again at the code for the negative step case, I think we really need to compute MaxBTC * Step for that case anyways, and correctly handle when it could overflow, so I reworked the code here to handle overflow at each computation consistently.
Below the (now obsolete) comment I already had written before the rework.
So shouldn't we actually be doing
(AR->getStart() - StartPtr) + (MaxBTC * Step) <= DerefBytes?
Yep, the current checks should be a conservative over-approximation, that drops the division which would probably complicate things due to the offset may not being a multiple of the step, and in that case the division would drop the remainder, which could lead to incorrect results due to undercounting the size. I updated the comment.
The reason for not checking the form above directly is to avoid multiplying by step, which may overflow itself.
if (SE.isKnownNegative(Step)) { | ||
// For negative steps, check using StartOffset == AR->getStart() - StartPtr: | ||
// * StartOffset >= MaxBTC * Step | ||
// * AND StartOffset <= DerefBytes / Step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inconsistent - the first condition treats StartOffset in units of bytes (i.e. >= MaxBTC * Step), whereas the second condition treats StartOffset in units of steps. Given that the StartOffset is not guaranteed to be a multiple of Step I think this should be:
// * StartOffset >= MaxBTC * abs(Step)
// * AND StartOffset <= DerefBytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated in the latest change, sharing most of the computation with the positive step case.
auto *StartOffset = SE.getNoopOrSignExtend( | ||
SE.getMinusSCEV(AR->getStart(), StartPtr), WiderTy); | ||
return SE.isKnownPredicate( | ||
CmpInst::ICMP_UGE, StartOffset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything that prevents StartOffset
from being negative? If it could be negative then we have to use ICMP_SGE here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be negative, updated thanks
Hi @fhahn, do you have any plans to work on this PR in the near future? I'm just asking because you suggested I hold off #133099 (enabling auto-vectorisation of loops with uncountable early exits by default) until this PR has landed. It would be nice to get the uncountable early exit work more widely tested by being enabled, since at the moment we're relying solely upon the LLVM IR tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fhahn, do you have any plans to work on this PR in the near future? I'm just asking because you suggested I hold off #133099 (enabling auto-vectorisation of loops with uncountable early exits by default) until this PR has landed. It would be nice to get the uncountable early exit work more widely tested by being enabled, since at the moment we're relying solely upon the LLVM IR tests.
Ah yes, thanks for the reminder, this dropped off my radar. Will try to update it this week.
Update SimplifyICmpOperands to only try subtracting 1 from RHS first, if RHS is an op we can fold the subtract directly into. Otherwise try adding to LHS first, as we can preserve NUW flags. This improves results in a few cases, including the modified test case from berkeley-abc and new code to be added in llvm#128061. Note that there are more cases where the results can be improved by better ordering here which I'll try to investigate as follow-up.
6628dc2
to
0989341
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for the latest updates! I just have a few more comments...
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header> | ||
; CHECK-NEXT: Group [[GRP4]]: | ||
; CHECK-NEXT: (Low: %A High: (2000 + %A)) | ||
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the fact that the GEPs in the loop explicitly say there is no wrapping (due to inbounds), it sounds like this patch is saying that for early exit loops 2000+%A
could wrap (hence the need to clamp the value), but for normal loops 2000+%A
cannot wrap (hence we don't need to clamp). I'm just struggling to understand the reasoning behind this. Having said that, I'm happy to accept this limitation for now if it makes life easier and helps to solve the immediate issue.
I think it's worth adding a TODO that we can refine this in future. I think anyone looking at this test might be confused about the discrepancy between normal and early exit loops that have the same mathematical maximum upper bound for memory accesses. Also, after this patch if we have an early exit loop that requires runtime memory checks then there is a very high chance of failing those checks due to the increase in upper bound to UINT64_MAX so it's questionable whether there is even any point vectorising loops with runtime memory checks?
MaxBTC = SE.getNoopOrZeroExtend(MaxBTC, WiderTy); | ||
|
||
// For the computations below, make sure they don't unsigned wrap. | ||
if (!SE.isKnownPredicate(CmpInst::ICMP_UGE, AR->getStart(), StartPtr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a hard theoretical limitation that prevents AR being something like (%p - 16)? Or does it just make the later calculations simpler? If it's the latter (sounds like it), then can you add a TODO that we can improve this in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently makes sure that AR->getStart() - StartPtr
cannot wrap. We cannot check if the result ofgetMinusSCEV
has NUW, as it will form a SCEVAddExpr with a negative operand, which will generally not be NUW.
// For the computations below, make sure they don't unsigned wrap. | ||
if (!SE.isKnownPredicate(CmpInst::ICMP_UGE, AR->getStart(), StartPtr)) | ||
return false; | ||
const SCEV *StartOffset = SE.getNoopOrSignExtend( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the above check guarantees SE.getMinusSCEV
will return a positive value I think you can use SE.getNoopOrZeroExtend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks.
/// return nullptr. | ||
static const SCEV *mulSCEVOverflow(const SCEV *A, const SCEV *B, | ||
ScalarEvolution &SE) { | ||
if (!SE.willNotOverflow(Instruction::Mul, false, A, B)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth adding something in the comments for mulSCEVOverflow
and addSCEVOverflow
that A
and B
need to have the same type in order to do the overflow check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment, thanks. They must have the same type for A * B to be valid.
return false; | ||
|
||
const SCEV *OffsetLastAccessedByte = addSCEVOverflow( | ||
OffsetAtLastIter, SE.getNoopOrZeroExtend(EltSize, WiderTy), SE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this isn't really the last accessed byte, which would really be EltSize - 1
. Might be clearer to simply call it EndOffset
?
I realise it makes the subsequent calculations simpler, but it does artificially increase the chance of overflow because in reality the pointer for the last accessed byte is (AR->getStart() + (MaxBTC * Step) + EltSize - 1)
. I'm fine with that, but maybe worth a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the names, thanks.
I realise it makes the subsequent calculations simpler, but it does artificially increase the chance of overflow because in reality the pointer for the last accessed byte is (AR->getStart() + (MaxBTC * Step) + EltSize - 1). I'm fine with that, but maybe worth a comment?
We are comparing the result<= DerefBytes
(i.e. the end of the accessed range is at most DerefBytes), so I think it shouldn't pessimize the results. If we subtract 1, we would check< DerefBytes
(i.e. the last accessed byte must be inside the dereferenceable bytes)
// For positive steps, check if | ||
// (AR->getStart() - StartPtr) + (MaxBTC * Step) + EltSize <= DerefBytes, | ||
// while making sure none of the computations unsigned wrap themselves. | ||
const SCEV *LastAccessedByte = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this isn't the last accessed byte - it's the last accessed byte plus one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to EndByte thanks
SE.isKnownPredicate(CmpInst::ICMP_ULE, StartOffset, | ||
SE.getConstant(WiderTy, DerefBytes)); | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we can't do anything useful if Step isn't known to be positive or negative, it's probably worth checking this much earlier on, i.e. something like:
bool IsKnownPositive = false, IsKnownNegative = false;
if (SE.isKnownPositive(Step))
IsKnownPositive = true;
else if (SE.isKnownNegative(Step))
IsKnownNegative = true;
else
return false;
then reusing the booleans here. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to store IsKnownNonNegative in a variable, and exit early if is is neither known non-negative nor negative. I changed to to check isKnonwNonNegative instead isKnownPositive, as non-negative is the important property.
else { | ||
// Evaluating AR at MaxBTC may wrap and create an expression that is less | ||
// than the start of the AddRec due to wrapping (for example consider | ||
// MaxBTC = -2). If that's the case, set ScEnd to -(EltSize + 1). ScEnd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I still don't understand why exact BTC = -2 in the if case above isn't a problem too, although I realise you've tried to explain this to me once already and I do believe that it's an issue. :)
It still sounds like there is a deficiency elsewhere in LAA that we're trying to workaround here, but I can live with that for now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is computable BTC, we must take the backedge exactly BTC times if the pointer expression would wrap then the original loop is guarnateed to have UB.
If we only have a symbolic max BTC, we take the backedge up to BTC times, but the original loop could exit before we wrap and trigger UB. If we would generate start and end pointers based on the symbolic max before the loop, those may wrap and cause incorrect runtime check results.
With the computable BTC, the start/end pointers can only wrap if the loop has UB.
@@ -10892,7 +10892,8 @@ bool ScalarEvolution::SimplifyICmpOperands(CmpPredicate &Pred, const SCEV *&LHS, | |||
} | |||
break; | |||
case ICmpInst::ICMP_UGE: | |||
if (!getUnsignedRangeMin(RHS).isMinValue()) { | |||
// If we can fold -1 into RHS, try that first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth a PR by itself if there are existing test for this? It looks like llvm/test/Transforms/IndVarSimplify/simplify-icmp-operands-order.ll
tests this? Having said that, I'm also happy if you want to keep it in this patch. I was just worried that for some reason there might be a regression post-commit so we'd also have to revert these changes too.
Update SimplifyICmpOperands to only try subtracting 1 from RHS first, if RHS is an op we can fold the subtract directly into. Otherwise try adding to LHS first, as we can preserve NUW flags. This improves results in a few cases, including the modified test case from berkeley-abc and new code to be added in llvm#128061. Note that there are more cases where the results can be improved by better ordering here which I'll try to investigate as follow-up.
…#144404) Update SimplifyICmpOperands to only try subtracting 1 from RHS first, if RHS is an op we can fold the subtract directly into. Otherwise try adding to LHS first, as we can preserve NUW flags. This improves results in a few cases, including the modified test case from berkeley-abc and new code to be added in #128061. Note that there are more cases where the results can be improved by better ordering here which I'll try to investigate as follow-up. PR: #144404
…ds for UGE. (#144404) Update SimplifyICmpOperands to only try subtracting 1 from RHS first, if RHS is an op we can fold the subtract directly into. Otherwise try adding to LHS first, as we can preserve NUW flags. This improves results in a few cases, including the modified test case from berkeley-abc and new code to be added in llvm/llvm-project#128061. Note that there are more cases where the results can be improved by better ordering here which I'll try to investigate as follow-up. PR: llvm/llvm-project#144404
…llvm#144404) Update SimplifyICmpOperands to only try subtracting 1 from RHS first, if RHS is an op we can fold the subtract directly into. Otherwise try adding to LHS first, as we can preserve NUW flags. This improves results in a few cases, including the modified test case from berkeley-abc and new code to be added in llvm#128061. Note that there are more cases where the results can be improved by better ordering here which I'll try to investigate as follow-up. PR: llvm#144404
…llvm#144404) Update SimplifyICmpOperands to only try subtracting 1 from RHS first, if RHS is an op we can fold the subtract directly into. Otherwise try adding to LHS first, as we can preserve NUW flags. This improves results in a few cases, including the modified test case from berkeley-abc and new code to be added in llvm#128061. Note that there are more cases where the results can be improved by better ordering here which I'll try to investigate as follow-up. PR: llvm#144404
Evaluating AR at the symbolic max BTC may wrap and create an expression that is less than the start of the AddRec due to wrapping (for example consider MaxBTC = -2). If that's the case, set ScEnd to -(EltSize + 1). ScEnd will get incremented by EltSize before returning, so this effectively sets ScEnd to unsigned max. Note that LAA separately checks that accesses cannot not wrap, so unsigned max represents an upper bound.
0989341
to
fc91973
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -130,10 +130,10 @@ define i32 @all_exits_dominate_latch_countable_exits_at_most_1000_iterations_kno | |||
; CHECK-NEXT: %gep.A = getelementptr inbounds i32, ptr %A, i64 %iv | |||
; CHECK-NEXT: Grouped accesses: | |||
; CHECK-NEXT: Group GRP0: | |||
; CHECK-NEXT: (Low: %B High: (4004 + %B)) | |||
; CHECK-NEXT: (Low: %B High: inttoptr (i64 -1 to ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a TODO because we shouldn't be doing this for dereferenceable pointers, right? Surely if it's guaranteed to be dereferenceable that implies it should not wrap? For example, I'm struggling to see how a C++ object passed by reference to a function could be allocated across a wrapped address space and be legal. I would expect any attempt to actually use the object triggers undefined behaviour in the C++ specification. I realise there's more to life than C++ - this is just one example of course.
Also, if I've understood correctly it will significantly impact @huntergr-arm's work to enable vectorisation of early exit loops with loads and stores from dereferenceable memory.
Again, I'm happy to accept the patch as is, just saying that we should improve this in future if we ever want to make progress with early exit vectorisaton.
; CHECK-NEXT: Member: {%B,+,4}<nuw><%loop.header> | ||
; CHECK-NEXT: Group [[GRP4]]: | ||
; CHECK-NEXT: (Low: %A High: (2000 + %A)) | ||
; CHECK-NEXT: (Low: %A High: inttoptr (i64 -1 to ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand. It sounds like you're saying that without the early exit we don't really care if the runtime checks are nonsense or not because the entire loop is UB anyway? Whereas for early exit loops the loop may or may not be UB and so we do care about getting the right runtime checks.
Evaluating AR at the symbolic max BTC may wrap and create an expression that is less than the start of the AddRec due to wrapping (for example consider MaxBTC = -2).
If that's the case, set ScEnd to -(EltSize + 1). ScEnd will get incremented by EltSize before returning, so this effectively sets ScEnd to unsigned max. Note that LAA separately checks that accesses cannot not wrap (52ded67, #127543), so unsigned max represents an upper bound.
When there is a computable backedge-taken count, we are guaranteed to execute the number of iterations, and if any pointer would wrap it would be UB (or the access will never be executed, so cannot alias). It includes new tests from the previous discussion that show a case we wrap with a BTC, but it is UB due to the pointer after the object wrapping (in
evaluate-at-backedge-taken-count-wrapping.ll
)Note that an earlier version of the patch was shared as #106530, but I accidentally deleted the branch and now I cannot figure out how to reopen that PR.