[llvm] [LoopPeel] Peel to make Phis loop inductions (PR #121104)

Wed Dec 25 03:17:03 PST 2024

https://github.com/kasuga-fj created https://github.com/llvm/llvm-project/pull/121104

LoopPeel now only handles Phis when they become loop invariants by peeling. There are cases where peeling makes Phis loop invariants, and peeling in such cases is also useful for other optimizations, such as loop vectorization. For example, consider the following loops.

```
int im = N-1;
for (int i=0;i<N;i++) {
  a[i] = b[i]+b[im];
  im = i;
}
```

In this case, peeling by 1 iteration makes `im` a loop induction, so we can vectorize the loop.
This patch allows to vectorize the kernel of s291 and s292 in TSVC. I have measured on neoverse-v2 and  observed a speedup of more than 60% (options: `-O3 -ffast-math -mcpu=neoverse-v2`).
Note that in some cases there was unnecessary peeling when tried with llvm-test-suite. The causes include peeling for a remainder loop of vectorization and the limitations of analysis by SCEV. However, as far as I've tried, these unnecessary peels do not affect performance.

This PR is taken over from #94900
Resolve #81851 

>From ae2fcaf8449a2f56e15ef7cfd560f37a3a482e7a Mon Sep 17 00:00:00 2001
From: Ryotaro Kasuga <kasuga.ryotaro at fujitsu.com>
Date: Fri, 20 Dec 2024 09:13:36 +0000
Subject: [PATCH] [LoopPeel] Peel to make Phis loop inductions

LoopPeel now only handles Phis when they become loop invariants by
peeling. There are cases where peeling makes Phis loop invariants, and
peeling in such cases is also useful for other optimizations, such as
loop vectorization. For example, consider the following loops.

```
int im = N-1;
for (int i=0;i<N;i++) {
  a[i] = b[i]+b[im];
  im = i;
}
```

In this case, peeling by 1 iteration makes `im` a loop induction, so we
can vectorize the loop.
This patch allows to vectorize the kernel of s291 and s292 in TSVC. I
have measured on neoverse-v2 and  observed a speedup of more than 60%
(options: `-O3 -ffast-math -mcpu=neoverse-v2`).
Note that in some cases there was unnecessary peeling when tried with
llvm-test-suite. The causes include peeling for a remainder loop of
vectorization and the limitations of analysis by SCEV. However, as far
as I've tried, these unnecessary peels do not affect performance.
---
 llvm/lib/Transforms/Utils/LoopPeel.cpp        | 133 +++++++++++++----
 .../LoopUnroll/peel-loop-phi-analysis.ll      | 139 ++++++++++++++++++
 2 files changed, 243 insertions(+), 29 deletions(-)

diff --git a/llvm/lib/Transforms/Utils/LoopPeel.cpp b/llvm/lib/Transforms/Utils/LoopPeel.cpp
index 3cbde39b30b4e4..f17046db3ce470 100644
--- a/llvm/lib/Transforms/Utils/LoopPeel.cpp
+++ b/llvm/lib/Transforms/Utils/LoopPeel.cpp
@@ -13,6 +13,7 @@
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
+#include "llvm/Analysis/IVDescriptors.h"
 #include "llvm/Analysis/Loads.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/LoopIterator.h"
@@ -151,6 +152,32 @@ namespace {
 // corresponding calls to g are determined and the code for computing
 // x, y, and a can be removed.
 //
+// Similarly, there are cases where peeling makes Phi nodes loop-inductions
+// (i.e., the value is increased or decreased by a fixed amount on every
+// iteration). For example, consider the following function.
+//
+//   #define N 100
+//   void f(int a[], int b[]) {
+//     int im = N - 1;
+//     for (int i = 0; i < N; i++) {
+//       a[i] = b[i] + b[im];
+//       im = i;
+//     }
+//   }
+//
+// The IR of the loop will look something like the following.
+//
+//   %i = phi i32 [ 0, %entry ], [ %i.next, %for.body ]
+//   %im = phi i32 [ 99, %entry ], [ %i, %for.body ]
+//   ...
+//   %i.next = add nuw nsw i32 %i, 1
+//   ...
+//
+// In this case, %im becomes a loop-induction variable by peeling 1 iteration,
+// because %i is a loop-induction one. The peeling count can be determined by
+// the same algorithm with loop-invariant case. Such peeling is profitable for
+// loop-vectorization.
+//
 // The PhiAnalyzer class calculates how many times a loop should be
 // peeled based on the above analysis of the phi nodes in the loop while
 // respecting the maximum specified.
@@ -160,7 +187,7 @@ class PhiAnalyzer {
 
   // Calculate the sufficient minimum number of iterations of the loop to peel
   // such that phi instructions become determined (subject to allowable limits)
-  std::optional<unsigned> calculateIterationsToPeel();
+  std::optional<unsigned> calculateIterationsToPeel(ScalarEvolution &SE);
 
 protected:
   using PeelCounter = std::optional<unsigned>;
@@ -175,13 +202,17 @@ class PhiAnalyzer {
 
   // Calculate the number of iterations after which the given value
   // becomes an invariant.
-  PeelCounter calculate(const Value &);
+  PeelCounter calculate(Value &, ScalarEvolution &SE);
+
+  // Returns true if the \p Phi is an induction in the target loop. This
+  // funciton is a wrapper of `InductionDescriptor::isInductionPHI`.
+  bool isInductionPHI(PHINode *Phi, ScalarEvolution &SE) const;
 
   const Loop &L;
   const unsigned MaxIterations;
 
-  // Map of Values to number of iterations to invariance
-  SmallDenseMap<const Value *, PeelCounter> IterationsToInvariance;
+  // Map of Values to number of iterations to invariance or induction
+  SmallDenseMap<const Value *, PeelCounter> IterationsToInvarianceOrInduction;
 };
 
 PhiAnalyzer::PhiAnalyzer(const Loop &L, unsigned MaxIterations)
@@ -190,6 +221,39 @@ PhiAnalyzer::PhiAnalyzer(const Loop &L, unsigned MaxIterations)
   assert(MaxIterations > 0 && "no peeling is allowed?");
 }
 
+bool PhiAnalyzer::isInductionPHI(PHINode *Phi, ScalarEvolution &SE) const {
+  if (!SE.isSCEVable(Phi->getType()))
+    return false;
+
+  const SCEV *Expr = SE.getSCEV(Phi);
+
+  // Ignore casts because they are noisy for peeling. For example, consider
+  // following loop.
+  //
+  //   unsigned long N = ...;
+  //   for (unsigned int i = 0; i < N; i++) {
+  //     ...
+  //   }
+  //
+  // The IR of the loop becomes something like the following.
+  //
+  //   %i = phi i32 [ 0, %entry ], [ %i.next, %body ]
+  //   %conv = phi i64 [ 0, %entry ], [ %conv.next, %body ]
+  //   ...
+  //   %i.next = add i32 %i, 1
+  //   %conv.next = zext i32 %i.next to i64
+  //   ...
+  //
+  // The SCEV of %conv becomes something like (zext i32 {0,+,1}<nuw><%body> to
+  // i64), and this is not an induction. However, as for peeling, it is better
+  // to ignore such outermost casts to avoid unnecessary peeling.
+  while (const auto *Cast = dyn_cast<SCEVCastExpr>(Expr))
+    Expr = Cast->getOperand();
+
+  InductionDescriptor ID;
+  return InductionDescriptor::isInductionPHI(Phi, &L, &SE, ID, Expr);
+}
+
 // This function calculates the number of iterations after which the value
 // becomes an invariant. The pre-calculated values are memorized in a map.
 // N.B. This number will be Unknown or <= MaxIterations.
@@ -204,59 +268,70 @@ PhiAnalyzer::PhiAnalyzer(const Loop &L, unsigned MaxIterations)
 //           %y = phi(0, 5)
 //           %a = %y + 1
 //   G(%y) = Unknown otherwise (including phi not in header block)
-PhiAnalyzer::PeelCounter PhiAnalyzer::calculate(const Value &V) {
+PhiAnalyzer::PeelCounter PhiAnalyzer::calculate(Value &V, ScalarEvolution &SE) {
   // If we already know the answer, take it from the map.
   // Otherwise, place Unknown to map to avoid infinite recursion. Such
   // cycles can never stop on an invariant.
-  auto [I, Inserted] = IterationsToInvariance.try_emplace(&V, Unknown);
+  auto [I, Inserted] =
+      IterationsToInvarianceOrInduction.try_emplace(&V, Unknown);
   if (!Inserted)
     return I->second;
 
   if (L.isLoopInvariant(&V))
     // Loop invariant so known at start.
-    return (IterationsToInvariance[&V] = 0);
-  if (const PHINode *Phi = dyn_cast<PHINode>(&V)) {
+    return (IterationsToInvarianceOrInduction[&V] = 0);
+  if (PHINode *Phi = dyn_cast<PHINode>(&V)) {
     if (Phi->getParent() != L.getHeader()) {
       // Phi is not in header block so Unknown.
-      assert(IterationsToInvariance[&V] == Unknown && "unexpected value saved");
+      assert(IterationsToInvarianceOrInduction[&V] == Unknown &&
+             "unexpected value saved");
       return Unknown;
     }
+
+    // If Phi is an induction, register it as a starting point.
+    if (isInductionPHI(Phi, SE))
+      return (IterationsToInvarianceOrInduction[&V] = 0);
+
     // We need to analyze the input from the back edge and add 1.
     Value *Input = Phi->getIncomingValueForBlock(L.getLoopLatch());
-    PeelCounter Iterations = calculate(*Input);
-    assert(IterationsToInvariance[Input] == Iterations &&
+    PeelCounter Iterations = calculate(*Input, SE);
+    assert(IterationsToInvarianceOrInduction[Input] == Iterations &&
            "unexpected value saved");
-    return (IterationsToInvariance[Phi] = addOne(Iterations));
+    return (IterationsToInvarianceOrInduction[Phi] = addOne(Iterations));
   }
   if (const Instruction *I = dyn_cast<Instruction>(&V)) {
     if (isa<CmpInst>(I) || I->isBinaryOp()) {
       // Binary instructions get the max of the operands.
-      PeelCounter LHS = calculate(*I->getOperand(0));
+      PeelCounter LHS = calculate(*I->getOperand(0), SE);
       if (LHS == Unknown)
         return Unknown;
-      PeelCounter RHS = calculate(*I->getOperand(1));
+      PeelCounter RHS = calculate(*I->getOperand(1), SE);
       if (RHS == Unknown)
         return Unknown;
-      return (IterationsToInvariance[I] = {std::max(*LHS, *RHS)});
+      return (IterationsToInvarianceOrInduction[I] = {std::max(*LHS, *RHS)});
     }
     if (I->isCast())
       // Cast instructions get the value of the operand.
-      return (IterationsToInvariance[I] = calculate(*I->getOperand(0)));
+      return (IterationsToInvarianceOrInduction[I] =
+                  calculate(*I->getOperand(0), SE));
   }
   // TODO: handle more expressions
 
   // Everything else is Unknown.
-  assert(IterationsToInvariance[&V] == Unknown && "unexpected value saved");
+  assert(IterationsToInvarianceOrInduction[&V] == Unknown &&
+         "unexpected value saved");
   return Unknown;
 }
 
-std::optional<unsigned> PhiAnalyzer::calculateIterationsToPeel() {
+std::optional<unsigned>
+PhiAnalyzer::calculateIterationsToPeel(ScalarEvolution &SE) {
   unsigned Iterations = 0;
   for (auto &PHI : L.getHeader()->phis()) {
-    PeelCounter ToInvariance = calculate(PHI);
-    if (ToInvariance != Unknown) {
-      assert(*ToInvariance <= MaxIterations && "bad result in phi analysis");
-      Iterations = std::max(Iterations, *ToInvariance);
+    PeelCounter ToInvarianceOrInduction = calculate(PHI, SE);
+    if (ToInvarianceOrInduction != Unknown) {
+      assert(*ToInvarianceOrInduction <= MaxIterations &&
+             "bad result in phi analysis");
+      Iterations = std::max(Iterations, *ToInvarianceOrInduction);
       if (Iterations == MaxIterations)
         break;
     }
@@ -585,14 +660,14 @@ void llvm::computePeelCount(Loop *L, unsigned LoopSize,
   // in TTI.getPeelingPreferences or by the flag -unroll-peel-count.
   unsigned DesiredPeelCount = TargetPeelCount;
 
-  // Here we try to get rid of Phis which become invariants after 1, 2, ..., N
-  // iterations of the loop. For this we compute the number for iterations after
-  // which every Phi is guaranteed to become an invariant, and try to peel the
-  // maximum number of iterations among these values, thus turning all those
-  // Phis into invariants.
+  // Here we try to get rid of Phis which become invariants or inductions after
+  // 1, 2, ..., N iterations of the loop. For this we compute the number for
+  // iterations after which every Phi is guaranteed to become an invariant or an
+  // induction, and try to peel the maximum number of iterations among these
+  // values, thus turning all those Phis into invariants or inductions.
   if (MaxPeelCount > DesiredPeelCount) {
     // Check how many iterations are useful for resolving Phis
-    auto NumPeels = PhiAnalyzer(*L, MaxPeelCount).calculateIterationsToPeel();
+    auto NumPeels = PhiAnalyzer(*L, MaxPeelCount).calculateIterationsToPeel(SE);
     if (NumPeels)
       DesiredPeelCount = std::max(DesiredPeelCount, *NumPeels);
   }
@@ -610,7 +685,7 @@ void llvm::computePeelCount(Loop *L, unsigned LoopSize,
     if (DesiredPeelCount + AlreadyPeeled <= UnrollPeelMaxCount) {
       LLVM_DEBUG(dbgs() << "Peel " << DesiredPeelCount
                         << " iteration(s) to turn"
-                        << " some Phis into invariants.\n");
+                        << " some Phis into invariants or inductions.\n");
       PP.PeelCount = DesiredPeelCount;
       PP.PeelProfiledIterations = false;
       return;
diff --git a/llvm/test/Transforms/LoopUnroll/peel-loop-phi-analysis.ll b/llvm/test/Transforms/LoopUnroll/peel-loop-phi-analysis.ll
index e24eeef52de4e9..0362892dcb1cdb 100644
--- a/llvm/test/Transforms/LoopUnroll/peel-loop-phi-analysis.ll
+++ b/llvm/test/Transforms/LoopUnroll/peel-loop-phi-analysis.ll
@@ -197,3 +197,142 @@ for.body:
   %exitcond = icmp eq i32 %inc, 100000
   br i1 %exitcond, label %for.cond.cleanup, label %for.body
 }
+
+; Check that phi analysis can handle a binary operator with induction variable.
+define void @_Z6binaryv_induction() {
+; The phis become induction through the chain of phis, with a unary
+; instruction on a loop induction.  Check that the phis for x, a, and y become
+; a loop induction since x is based on y, which is based on a, which is based
+; on a binary add of a constant and i, which is a loop induction.
+; Consider the calls to g:
+; First iteration: g(0), x=0, g(0), y=1, a=2
+; Second iteration: g(0), x=1, g(2), y=3, a=3
+; Third iteration: g(1), x=3, g(3), y=4, a=4
+; Fourth iteration (and subsequent): g(i), x=i+1, g(i+1), y=i+2, a=i+2
+; Therefore, peeling 3 times removes the phi nodes.
+;
+; void g(int);
+; void binary() {
+;   int x = 0;
+;   int y = 0;
+;   int a = 0;
+;   for(int i = 0; i <100000; ++i) {
+;     g(x);
+;     x = y;
+;     g(a);
+;     y = a + 1;
+;     a = i + 2;
+;   }
+; }
+; CHECK-LABEL: @_Z6binaryv_induction(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_BODY_PEEL_BEGIN:%.*]]
+; CHECK:       for.body.peel.begin:
+; CHECK-NEXT:    br label [[FOR_BODY_PEEL:%.*]]
+; CHECK:       for.body.peel:
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext 0)
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext 0)
+; CHECK-NEXT:    [[ADD_PEEL:%.*]] = add nuw nsw i32 0, 2
+; CHECK-NEXT:    [[INC_PEEL:%.*]] = add nuw nsw i32 0, 1
+; CHECK-NEXT:    [[EXITCOND_PEEL:%.*]] = icmp ne i32 [[INC_PEEL]], 100000
+; CHECK-NEXT:    br i1 [[EXITCOND_PEEL]], label [[FOR_BODY_PEEL_NEXT:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK:       for.body.peel.next:
+; CHECK-NEXT:    br label [[FOR_BODY_PEEL2:%.*]]
+; CHECK:       for.body.peel2:
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext 0)
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext [[ADD_PEEL]])
+; CHECK-NEXT:    [[ADD_PEEL3:%.*]] = add nuw nsw i32 [[INC_PEEL]], 2
+; CHECK-NEXT:    [[INC_PEEL4:%.*]] = add nuw nsw i32 [[INC_PEEL]], 1
+; CHECK-NEXT:    [[EXITCOND_PEEL5:%.*]] = icmp ne i32 [[INC_PEEL4]], 100000
+; CHECK-NEXT:    br i1 [[EXITCOND_PEEL5]], label [[FOR_BODY_PEEL_NEXT1:%.*]], label [[FOR_COND_CLEANUP]]
+; CHECK:       for.body.peel.next1:
+; CHECK-NEXT:    br label [[FOR_BODY_PEEL7:%.*]]
+; CHECK:       for.body.peel7:
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext 0)
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext [[ADD_PEEL3]])
+; CHECK-NEXT:    [[ADD_PEEL8:%.*]] = add nuw nsw i32 [[INC_PEEL4]], 2
+; CHECK-NEXT:    [[INC_PEEL9:%.*]] = add nuw nsw i32 [[INC_PEEL4]], 1
+; CHECK-NEXT:    [[EXITCOND_PEEL10:%.*]] = icmp ne i32 [[INC_PEEL9]], 100000
+; CHECK-NEXT:    br i1 [[EXITCOND_PEEL10]], label [[FOR_BODY_PEEL_NEXT6:%.*]], label [[FOR_COND_CLEANUP]]
+; CHECK:       for.body.peel.next6:
+; CHECK-NEXT:    br label [[FOR_BODY_PEEL_NEXT11:%.*]]
+; CHECK:       for.body.peel.next11:
+; CHECK-NEXT:    br label [[ENTRY_PEEL_NEWPH:%.*]]
+; CHECK:       entry.peel.newph:
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.cond.cleanup.loopexit:
+; CHECK-NEXT:    br label [[FOR_COND_CLEANUP]]
+; CHECK:       for.cond.cleanup:
+; CHECK-NEXT:    ret void
+; CHECK:       for.body:
+; CHECK-NEXT:    [[I:%.*]] = phi i32 [ [[INC_PEEL9]], [[ENTRY_PEEL_NEWPH]] ], [ [[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[X:%.*]] = phi i32 [ [[ADD_PEEL]], [[ENTRY_PEEL_NEWPH]] ], [ [[Y:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[A:%.*]] = phi i32 [ [[ADD_PEEL8]], [[ENTRY_PEEL_NEWPH]] ], [ [[ADD:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[Y]] = phi i32 [ [[ADD_PEEL3]], [[ENTRY_PEEL_NEWPH]] ], [ [[A]], [[FOR_BODY]] ]
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext [[X]])
+; CHECK-NEXT:    tail call void @_Z1gi(i32 signext [[A]])
+; CHECK-NEXT:    [[ADD]] = add nuw nsw i32 [[I]], 2
+; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[I]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp ne i32 [[INC]], 100000
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], !llvm.loop [[LOOP3:![0-9]+]]
+;
+entry:
+  br label %for.body
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %i = phi i32 [ 0, %entry ], [ %inc, %for.body ]
+  %x = phi i32 [ 0, %entry ], [ %y, %for.body ]
+  %a = phi i32 [ 0, %entry ], [ %add, %for.body ]
+  %y = phi i32 [ 0, %entry ], [ %a, %for.body ]
+  tail call void @_Z1gi(i32 signext %x)
+  tail call void @_Z1gi(i32 signext %a)
+  %add = add nuw nsw i32 %i, 2
+  %inc = add nuw nsw i32 %i, 1
+  %exitcond = icmp ne i32 %inc, 100000
+  br i1 %exitcond, label %for.body, label %for.cond.cleanup
+}
+
+; Check that phi analysis can handle cast operations with induction variable.
+define void @_Z6induction_with_cast(ptr noundef %a, i64 noundef %size) {
+; The original code is like as follows. We don't need peel the loop to make
+; phis loop induction.
+;
+; void f(unsigned int *a, unsigned long N) {
+;   for (unsigned int i=0; i<N; i++)
+;     a[i] = 10;
+; }
+;
+; CHECK-LABEL: @_Z6induction_with_cast(
+; CHECK-NEXT:  for.body.preheader:
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[CONV6:%.*]] = phi i64 [ [[CONV:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER:%.*]] ]
+; CHECK-NEXT:    [[I_05:%.*]] = phi i32 [ [[ADD:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[A:%.*]], i64 [[CONV6]]
+; CHECK-NEXT:    store i32 10, ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT:    [[ADD]] = add i32 [[I_05]], 1
+; CHECK-NEXT:    [[CONV]] = zext i32 [[ADD]] to i64
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[SIZE:%.*]], [[CONV]]
+; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP:%.*]]
+; CHECK:       for.cond.cleanup:
+; CHECK-NEXT:    ret void
+;
+for.body.preheader:
+  br label %for.body
+
+for.body:
+  %conv6 = phi i64 [ %conv, %for.body ], [ 0, %for.body.preheader ]
+  %i.05 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ]
+  %arrayidx = getelementptr inbounds nuw i32, ptr %a, i64 %conv6
+  store i32 10, ptr %arrayidx, align 4
+  %add = add i32 %i.05, 1
+  %conv = zext i32 %add to i64
+  %cmp = icmp ugt i64 %size, %conv
+  br i1 %cmp, label %for.body, label %for.cond.cleanup
+
+for.cond.cleanup:
+  ret void
+}