[ARM] Reduce loop unroll when low overhead branching is available #120065

VladiKrapp-Arm · 2024-12-16T11:05:36Z

For processors with low overhead branching (LOB), runtime unrolling the innermost loop is often detrimental to performance. In these cases the loop remainder gets unrolled into a series of compare-and-jump blocks, which in deeply nested loops get executed multiple times, negating the benefits of LOB.
This is particularly noticable when the loop trip count of the innermost loop varies within the outer loop, such as in the case of triangular matrix decompositions.

In these cases we will prefer to not unroll the innermost loop, with the intention for it to be executed as a low overhead loop.

llvmbot · 2024-12-16T11:06:11Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-arm

Author: Vladi Krapp (VladiKrapp-Arm)

Changes

For processors with low overhead branching (LOB), runtime unrolling the innermost loop is often detrimental to performance. In these cases the loop remainder gets unrolled into a series of compare-and-jump blocks, which in deeply nested loops get executed multiple times, negating the benefits of LOB.
This is particularly noticable when the loop trip count of the innermost loop varies withing the outer loop, such as in the case of triangular matrix decompositions.

In these cases we will prefer to not unroll the innermost loop, with the intention for it to be executed as a low overhead loop.

Full diff: https://github.com/llvm/llvm-project/pull/120065.diff

2 Files Affected:

(modified) llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp (+13-1)
(modified) llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll (+18-9)

diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index 0e29648a7a284f..d336a69718cd36 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -2592,11 +2592,23 @@ void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
       return;
   }
 
+  bool Runtime = true;
+  if (ST->hasLOB()) {
+    if (SE.hasLoopInvariantBackedgeTakenCount(L)) {
+      const auto *BETC = SE.getBackedgeTakenCount(L);
+      auto *Outer = L->getOutermostLoop();
+      if ((L != Outer && Outer != L->getParentLoop()) ||
+          (L != Outer && BETC && !SE.isLoopInvariant(BETC, Outer))) {
+        Runtime = false;
+      }
+    }
+  }
+
   LLVM_DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");
   LLVM_DEBUG(dbgs() << "Default Runtime Unroll Count: " << UnrollCount << "\n");
 
   UP.Partial = true;
-  UP.Runtime = true;
+  UP.Runtime = Runtime;
   UP.UnrollRemainder = true;
   UP.DefaultUnrollRuntimeCount = UnrollCount;
   UP.UnrollAndJam = true;
diff --git a/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll b/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll
index b155f5d31045f9..111bc96b28806a 100644
--- a/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll
+++ b/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll
@@ -1,17 +1,23 @@
+; RUN: opt -mcpu=cortex-m7 -mtriple=thumbv8.1m.main -passes=loop-unroll -S  %s -o - | FileCheck %s --check-prefix=NLOB
 ; RUN: opt -mcpu=cortex-m55 -mtriple=thumbv8.1m.main -passes=loop-unroll -S  %s -o - | FileCheck %s --check-prefix=LOB
 
 ; This test checks behaviour of loop unrolling on processors with low overhead branching available 
 
-; LOB-CHECK-LABEL: for.body{{.*}}.prol
-; LOB-COUNT-1:     fmul fast float 
-; LOB-CHECK-LABEL: for.body{{.*}}.prol.1
-; LOB-COUNT-1:     fmul fast float 
-; LOB-CHECK-LABEL: for.body{{.*}}.prol.2
-; LOB-COUNT-1:     fmul fast float 
-; LOB-CHECK-LABEL: for.body{{.*}}
-; LOB-COUNT-4:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}.prol:
+; NLOB-COUNT-1:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}.prol.1:
+; NLOB-COUNT-1:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}.prol.2:
+; NLOB-COUNT-1:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}:
+; NLOB-COUNT-4:     fmul fast float 
+; NLOB-NOT:     fmul fast float 
+
+; LOB-LABEL: for.body{{.*}}:
+; LOB:     fmul fast float 
 ; LOB-NOT:     fmul fast float 
 
+
 ; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite)
 define dso_local void @test(i32 noundef %n, ptr nocapture noundef %pA) local_unnamed_addr #0 {
 entry:
@@ -20,7 +26,7 @@ entry:
 
 for.cond.loopexit:                                ; preds = %for.cond6.for.cond.cleanup8_crit_edge.us, %for.body
   %exitcond49.not = icmp eq i32 %add, %n
-  br i1 %exitcond49.not, label %for.cond.cleanup, label %for.body
+  br i1 %exitcond49.not, label %for.cond.cleanup, label %for.body, !llvm.loop !0
 
 for.cond.cleanup:                                 ; preds = %for.cond.loopexit, %entry
   ret void
@@ -61,3 +67,6 @@ for.cond6.for.cond.cleanup8_crit_edge.us:         ; preds = %for.body9.us
   br i1 %exitcond48.not, label %for.cond.loopexit, label %for.cond6.preheader.us
 }
 
+!0 = distinct !{!0, !1, !2}
+!1 = !{!"llvm.loop.mustprogress"}
+!2 = !{!"llvm.loop.unroll.disable"}

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

For processors with low overhead branching (LOB), runtime unrolling the innermost loop is often detrimental to performance. In these cases the loop remainder gets unrolled into a series of compare-and-jump blocks, which in deeply nested loops get executed multiple times, negating the benefits of LOB. This is particularly noticable when the loop trip count of the innermost loop varies within the outer loop, such as in the case of triangular matrix decompositions. In these cases we will prefer to not unroll the innermost loop, with the intention for it to be executed as a low overhead loop.

davemgreen

Thanks. LGTM

llvmbot added backend:ARM llvm:transforms labels Dec 16, 2024

davemgreen reviewed Dec 16, 2024

View reviewed changes

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp Show resolved Hide resolved

davemgreen requested a review from stuij December 16, 2024 18:28

VladiKrapp-Arm force-pushed the arm_no_deep_unroll_with_lob branch from 72d5967 to eaaa7fd Compare December 17, 2024 09:27

davemgreen approved these changes Dec 17, 2024

View reviewed changes

stuij merged commit f8d2704 into llvm:main Dec 18, 2024
8 checks passed

This was referenced Jun 2, 2025

[MTE] [NFC] use vector to collect globals to tag (#120283) #142329

Closed

[MTE] [NFC] use vector to collect globals to tag (#120283) #142330

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ARM] Reduce loop unroll when low overhead branching is available #120065

[ARM] Reduce loop unroll when low overhead branching is available #120065

Uh oh!

VladiKrapp-Arm commented Dec 16, 2024 •

edited

Loading

Uh oh!

llvmbot commented Dec 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

davemgreen left a comment

Uh oh!

Uh oh!

Uh oh!

[ARM] Reduce loop unroll when low overhead branching is available #120065

[ARM] Reduce loop unroll when low overhead branching is available #120065

Uh oh!

Conversation

VladiKrapp-Arm commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

VladiKrapp-Arm commented Dec 16, 2024 •

edited

Loading

llvmbot commented Dec 16, 2024 •

edited

Loading