[llvm] [lsr][term-fold] Restrict expansion budget for profiled loops (PR #74747)

Wed Jan 31 12:58:35 PST 2024

https://github.com/preames updated https://github.com/llvm/llvm-project/pull/74747

>From 4d57ba17d2a9875284a77078b1a630f9d262864e Mon Sep 17 00:00:00 2001
From: Philip Reames <preames at rivosinc.com>
Date: Thu, 7 Dec 2023 10:40:46 -0800
Subject: [PATCH 1/2] [lsr][term-fold] Restrict expansion budget, particularly
 for profiled loops

This is a follow up to an item I noted in my submission comment for
e947f95.  I don't have a real world example where this is triggering
unprofitably, but avoiding the transform when we estimate the loop
to be short running from profiling seems quite reasonable.

A couple of observations:
* Because we already checked for affine addrecs, the worst expansion we're
  going to see is the unknown trip count * unknown step case (both are
  loop invariant).  I picked a fallback cost (2 * SCEV's default expansion
  budget of 4) which was high enough to allow this.  These cases will
  end up being somewhat rare anyways as finding one where we can prove
  legality takes some work.
* Short running loops with constant trip counts produce either constant
  expressions (if the step is constant) or multiplications by small
  constant (if the step is unknown).  Both of these are cheap by any
  reasonable budget and thus not wroth checking for explicitly.
* Estimated trip counts come from profile data, loops without profile
  data will not have one.

Overall, this doesn't end up making much of a difference in practical
behavior of the transform.  Probably still worth landing, but I expect
this to be pretty low impact.

Possible future work (which I have no plans of doing at this time):
* Ppick the lowest cost alternate IV if we have multiple ones under the
  budget.
* Use SCEV to prove a constant upper bound on the trip count, and restrict
  the transform for unknown-but-small trip counts.
* I think we can now remove the affine restriction if we wanted.  Would
  need to think carefully about legality on this one.

I don't have examples where any of these would be profitable in practice,
but we'll see what comes up in the future.
---
 .../Transforms/Scalar/LoopStrengthReduce.cpp  | 22 +++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
index 7ebc5da8b25a5..6ab795f4e2cf9 100644
--- a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
@@ -6762,7 +6762,7 @@ static llvm::PHINode *GetInductionVariable(const Loop &L, ScalarEvolution &SE,
 
 static std::optional<std::tuple<PHINode *, PHINode *, const SCEV *, bool>>
 canFoldTermCondOfLoop(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
-                      const LoopInfo &LI) {
+                      const LoopInfo &LI, const TargetTransformInfo &TTI) {
   if (!L->isInnermost()) {
     LLVM_DEBUG(dbgs() << "Cannot fold on non-innermost loop\n");
     return std::nullopt;
@@ -6813,6 +6813,15 @@ canFoldTermCondOfLoop(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
   if (!isAlmostDeadIV(ToFold, LoopLatch, TermCond))
     return std::nullopt;
 
+  // Inserting instructions in the preheader has a runtime cost, scale
+  // the allowed cost with the loops trip count as best we can.
+  const unsigned ExpansionBudget = [&]() {
+    if (std::optional<unsigned> SmallTC = getLoopEstimatedTripCount(L))
+      return std::min(2*SCEVCheapExpansionBudget, *SmallTC);
+    // Unknown trip count, assume long running by default.
+    return 2*SCEVCheapExpansionBudget;
+  }();
+
   const SCEV *BECount = SE.getBackedgeTakenCount(L);
   const DataLayout &DL = L->getHeader()->getModule()->getDataLayout();
   SCEVExpander Expander(SE, DL, "lsr_fold_term_cond");
@@ -6820,6 +6829,7 @@ canFoldTermCondOfLoop(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
   PHINode *ToHelpFold = nullptr;
   const SCEV *TermValueS = nullptr;
   bool MustDropPoison = false;
+  auto InsertPt = L->getLoopPreheader()->getTerminator();
   for (PHINode &PN : L->getHeader()->phis()) {
     if (ToFold == &PN)
       continue;
@@ -6861,6 +6871,14 @@ canFoldTermCondOfLoop(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
       continue;
     }
 
+    if (Expander.isHighCostExpansion(TermValueSLocal, L, ExpansionBudget,
+                                     &TTI, InsertPt)) {
+      LLVM_DEBUG(
+          dbgs() << "Is too expensive to expand terminating value for phi node"
+                 << PN << "\n");
+      continue;
+    }
+
     // The candidate IV may have been otherwise dead and poison from the
     // very first iteration.  If we can't disprove that, we can't use the IV.
     if (!mustExecuteUBIfPoisonOnPathTo(&PN, LoopLatch->getTerminator(), &DT)) {
@@ -6986,7 +7004,7 @@ static bool ReduceLoopStrength(Loop *L, IVUsers &IU, ScalarEvolution &SE,
   }();
 
   if (EnableFormTerm) {
-    if (auto Opt = canFoldTermCondOfLoop(L, SE, DT, LI)) {
+    if (auto Opt = canFoldTermCondOfLoop(L, SE, DT, LI, TTI)) {
       auto [ToFold, ToHelpFold, TermValueS, MustDrop] = *Opt;
 
       Changed = true;

>From f63c971315dccc08771f134a1cc17f2e72425875 Mon Sep 17 00:00:00 2001
From: Philip Reames <preames at rivosinc.com>
Date: Wed, 31 Jan 2024 12:58:02 -0800
Subject: [PATCH 2/2] [LSR] Use simplest constant threshold for the moment

---
 .../Transforms/Scalar/LoopStrengthReduce.cpp  | 12 ++----------
 .../LoopStrengthReduce/lsr-term-fold.ll       | 19 ++++---------------
 2 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
index 6ab795f4e2cf9..831911ab21c22 100644
--- a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
@@ -6813,15 +6813,6 @@ canFoldTermCondOfLoop(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
   if (!isAlmostDeadIV(ToFold, LoopLatch, TermCond))
     return std::nullopt;
 
-  // Inserting instructions in the preheader has a runtime cost, scale
-  // the allowed cost with the loops trip count as best we can.
-  const unsigned ExpansionBudget = [&]() {
-    if (std::optional<unsigned> SmallTC = getLoopEstimatedTripCount(L))
-      return std::min(2*SCEVCheapExpansionBudget, *SmallTC);
-    // Unknown trip count, assume long running by default.
-    return 2*SCEVCheapExpansionBudget;
-  }();
-
   const SCEV *BECount = SE.getBackedgeTakenCount(L);
   const DataLayout &DL = L->getHeader()->getModule()->getDataLayout();
   SCEVExpander Expander(SE, DL, "lsr_fold_term_cond");
@@ -6871,7 +6862,8 @@ canFoldTermCondOfLoop(Loop *L, ScalarEvolution &SE, DominatorTree &DT,
       continue;
     }
 
-    if (Expander.isHighCostExpansion(TermValueSLocal, L, ExpansionBudget,
+    if (Expander.isHighCostExpansion(TermValueSLocal, L,
+                                     2*SCEVCheapExpansionBudget,
                                      &TTI, InsertPt)) {
       LLVM_DEBUG(
           dbgs() << "Is too expensive to expand terminating value for phi node"
diff --git a/llvm/test/Transforms/LoopStrengthReduce/lsr-term-fold.ll b/llvm/test/Transforms/LoopStrengthReduce/lsr-term-fold.ll
index 9c0a5709273c1..c6ffca5f145e4 100644
--- a/llvm/test/Transforms/LoopStrengthReduce/lsr-term-fold.ll
+++ b/llvm/test/Transforms/LoopStrengthReduce/lsr-term-fold.ll
@@ -599,26 +599,15 @@ define void @expensive_expand_unknown_tc2(ptr %a, i32 %offset, i32 %n, i32 %step
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[OFFSET_NONZERO:%.*]] = or i32 [[OFFSET:%.*]], 1
 ; CHECK-NEXT:    [[UGLYGEP:%.*]] = getelementptr i8, ptr [[A:%.*]], i64 84
-; CHECK-NEXT:    [[SMAX:%.*]] = call i32 @llvm.smax.i32(i32 [[STEP:%.*]], i32 [[N:%.*]])
-; CHECK-NEXT:    [[TMP0:%.*]] = sub i32 [[SMAX]], [[STEP]]
-; CHECK-NEXT:    [[UMIN:%.*]] = call i32 @llvm.umin.i32(i32 [[TMP0]], i32 1)
-; CHECK-NEXT:    [[TMP1:%.*]] = sub i32 [[TMP0]], [[UMIN]]
-; CHECK-NEXT:    [[UMAX:%.*]] = call i32 @llvm.umax.i32(i32 [[STEP]], i32 1)
-; CHECK-NEXT:    [[TMP2:%.*]] = udiv i32 [[TMP1]], [[UMAX]]
-; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[UMIN]], [[TMP2]]
-; CHECK-NEXT:    [[TMP4:%.*]] = zext i32 [[TMP3]] to i64
-; CHECK-NEXT:    [[TMP5:%.*]] = add nuw nsw i64 [[TMP4]], 1
-; CHECK-NEXT:    [[TMP6:%.*]] = sext i32 [[OFFSET_NONZERO]] to i64
-; CHECK-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP5]], [[TMP6]]
-; CHECK-NEXT:    [[TMP8:%.*]] = add nsw i64 [[TMP7]], 84
-; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP8]]
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[LSR_IV1:%.*]] = phi ptr [ [[UGLYGEP2:%.*]], [[FOR_BODY]] ], [ [[UGLYGEP]], [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[LSR_IV:%.*]] = phi i32 [ [[LSR_IV_NEXT:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]
 ; CHECK-NEXT:    store i32 1, ptr [[LSR_IV1]], align 4
+; CHECK-NEXT:    [[LSR_IV_NEXT]] = add nsw i32 [[LSR_IV]], [[STEP:%.*]]
 ; CHECK-NEXT:    [[UGLYGEP2]] = getelementptr i8, ptr [[LSR_IV1]], i32 [[OFFSET_NONZERO]]
-; CHECK-NEXT:    [[LSR_FOLD_TERM_COND_REPLACED_TERM_COND:%.*]] = icmp eq ptr [[UGLYGEP2]], [[SCEVGEP]]
-; CHECK-NEXT:    br i1 [[LSR_FOLD_TERM_COND_REPLACED_TERM_COND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp sge i32 [[LSR_IV_NEXT]], [[N:%.*]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_END:%.*]], label [[FOR_BODY]]
 ; CHECK:       for.end:
 ; CHECK-NEXT:    ret void
 ;