[PATCH] [Patch] Loop Interchange Pass

Thu Feb 12 07:17:41 PST 2015

Hi Hal,
Thanks for the review. Please find my comments updated below. Will upload modified patch shortly.

P.S. Sorry for the long comments,
Thanks and Regards
Karthik Bhat


REPOSITORY
  rL LLVM

================
Comment at: include/llvm/Transforms/Scalar.h:135
@@ -134,1 +134,3 @@
 //
+// LoopInterchange - This pass is interchanges loops to give better cache hits.
+//
----------------
hfinkel wrote:
> How about, "This pass interchanges loops to provide a more cache-friendly memory access patterns."
Yes of course your comment makes more sense.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:49
@@ +48,3 @@
+
+unsigned getInnerLoopCount(Loop &L, unsigned level) {
+  int lev = 0;
----------------
hfinkel wrote:
> Please add a comment explaining what this function computes?
This function get the maximum nesting level of the innermost loop. We use this to push loops of depth 2 to worklist.
For e.g.
  for(int i=0;i<N;i++)
    for(int j=0j<M;j++)
      for(int k=0;k<K,k++)
here we want to return 3 as the max nesting level is 3. I have renamed the function and added comment also modified this function a bit to correctly return the max loop depth in case we have multiple inner loops. For e.g.
  for(int i=0;i<N;i++) {

      for(int j=0j<M;j++) {
         for(int k=0;k<K;k++) {
           // this loop has depth 3
        }
      }

     for(intk=0;k<K;k++) {
     // this loop has depth 2
     }
  }

In the above case we still return 3 as it is the max depth.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:61
@@ +60,3 @@
+  // Also handle loop depth of 2 more appropriatly.
+  if (getInnerLoopCount(L, 0) == 2) {
+    Loop *Inner;
----------------
hfinkel wrote:
> LoopInfo already has a getLoopDepth() function. Can you use that?
getLoopDepth currently only returns the nesting level of the current loop. Since we have access to outer loop here we always get nesting level as 1. So had to go with the recursive function above.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:75
@@ +74,3 @@
+    PHINode *phiVar = dyn_cast<PHINode>(I);
+    if (phiVar->getNumIncomingValues() != 2)
+      continue;
----------------
hfinkel wrote:
> This is not actually what you want. If the loop is branched to by, for example, multiple entries of the switch statement, the predecessor can be listed multiple times in the predecessor list (and, thus, you'll have more than two incoming values even though you have only 2 predecessor blocks). 
> 
> I suspect that what you actually want is that there is a unique latch and a unique predecessor, so you want that L->getLoopLatch() && L->getLoopPredecessor() [neither are nullptr].
Updated the code. Thanks for clarifying the problem.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:81
@@ +80,3 @@
+      continue;
+    const SCEV *Step = AddRec->getStepRecurrence(*SE);
+    const SCEVConstant *C = dyn_cast<SCEVConstant>(Step);
----------------
hfinkel wrote:
> Do you also need to check that AddRec->isAffine()?
Yes i fell we need to check isAffine as well. Thanks updated the code.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:86
@@ +85,3 @@
+    // Found the induction variable.
+    // TODO: What if we have 2 induction variable? Currently legality makes sure
+    // we have only 1 induction variable
----------------
hfinkel wrote:
> Let's say:
> 
>   // FIXME: Handle loops with more than one induction variable. Note that, currently, legality makes sure we have only one induction variable.
OK. Done.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:212
@@ +211,3 @@
+    // Build up a worklist of loop pairs to analyze.
+    // [TODO] Currently only supports loop with level 2.
+    // Handle for loops greater than level 2.
----------------
hfinkel wrote:
> I'd move this FIXME comment somewhere else, it is not particularly useful here. It is more useful to tag places that assume only two levels.
OK..

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:261
@@ +260,3 @@
+  BranchInst *outerLoopHeaderBI =
+      dyn_cast<BranchInst>(outerLoopHeader->getTerminator());
+  unsigned num = outerLoopHeaderBI->getNumSuccessors();
----------------
hfinkel wrote:
> Either you should handle the case where the dyn_cast fails, or if it can't fail (because we've already verified that this must be a BranchInst), then use cast<> instead.
> 
> This same comment applies to many places below as well. Only use dyn_cast if the cast can fail (in which case you should handle the nullptr case). Otherwise, use cast<>.
Updated code to use cast<> wherever possible. Added null checks in places were dyn_cast is being used.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:273
@@ +272,3 @@
+  // The idea here is that we only need to check for usage of induction
+  // variables in header/latch to find extra instructions in the loop.Any loop
+  // independent instructions would anyhow have been sinked/hoised out of the
----------------
hfinkel wrote:
> Space before 'Any'
Modified comment.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:275
@@ +274,3 @@
+  // independent instructions would anyhow have been sinked/hoised out of the
+  // loop by licm.
+  PHINode *indVar = getInductionVariable(outerLoop, SE);
----------------
hfinkel wrote:
> licm -> LICM
Modified comment.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:297
@@ +296,3 @@
+       ++I) {
+    if (isa<CallInst>(I))
+      return false;
----------------
hfinkel wrote:
> What are you actually trying to check here? Instructions with side effects? Maybe you want I->mayHaveSideEffects()?
Hi Hal,
The way i'm trying to conclude that a loop is tightly nested is as follows-
1) There should not be any extra block between the outer loop and inner loop.
 (i.e. in this case the outer loop header would branch to inner loop preheader/inner loop body && the other branch in the header would go to the outer loop latch).

With this check we can catch loops which have a block inbetween outer and inner loop such as -

  for(int i=0;i<N;i++) {
    if(X) { }
    for(int j=0;j<N;j++) {
   }
  }
and conclude these as not tightly nested.

2) Second type of non nested loops can be-

  for(int i=0;i<N;i++) {
    a = i;
    k = A[i];
    for(int j=0;j<N;j++) {
    }
  }
these kind of loops will be caught by the second check which check we have a single use of indvar in latch or header which is the operand to Induction Phi(i.e used to increment/decrement the loop counter). I have modified this function a bit in the updated patch.

In the 3rd case i was trying to catch loops such as -
  for(int i=0;i<N;i++) {
    foo();
    for(int j=0;j<N;j++) {
    }
  }
I think we can do it using a combination of mayHaveSideEffects and mayReadFromMemory. Updated the patch.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:309
@@ +308,3 @@
+       ++I) {
+    if (isa<CallInst>(I))
+      return false;
----------------
hfinkel wrote:
> These checks look identical to those above, please make a function (a lambda function is fine).
Done.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:363
@@ +362,3 @@
+          // TODO: Flow dependency can be interchanged??
+          DEBUG(dbgs() << "Flow dependence not handled");
+          return false;
----------------
hfinkel wrote:
> Can you include *Src and *Des in these debug messages so that we can see what instructions are relevant?
Done.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:415
@@ +414,3 @@
+  // We currently handle only 1 induction variable inside the loop. We also do
+  // not handle reductions as of now.
+  for (auto I = innerLoopHeader->begin(), E = innerLoopHeader->end(); I != E;
----------------
hfinkel wrote:
> How are you checking for reductions here? Do you need to check that the one PHI you've found is not used outside of the loop?
We are currently checking if there is only 1 PHI node in the lop header which will corrospond to the induction variable. If we find any other PHI's either due to reductions or triangular loop structure. We currently exit as  current limiation.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:446
@@ +445,3 @@
+    BasicBlock *BB = II->getParent();
+    if (BB == innerLoopLatch)
+      userCount++;
----------------
hfinkel wrote:
> Why are you only counting uses in the latch block? Should the increment be in some other block, then what?
We count the uses in loop latch only as we split the latch based on this instruction. This was done because the mentioned example generated code as -
  for.body3:                                        ; preds = %for.body3, %for.body3.lr.ph
  %j.018 = phi i32 [ 0, %for.body3.lr.ph ], [ %add6, %for.body3 ]
  %arrayidx4 = getelementptr inbounds [100 x [100 x i32]]* @A, i32 0, i32 %j.018, i32 %i.020
  %5 = load i32* %arrayidx4, align 4, !tbaa !1
  %add = add nsw i32 %3, %5
  %add6 = add nuw nsw i32 %j.018, 1
  %arrayidx8 = getelementptr inbounds [100 x [100 x i32]]* @A, i32 0, i32 %add6, i32 %add5
  store i32 %add, i32* %arrayidx8, align 4, !tbaa !1
  %exitcond = icmp eq i32 %j.018, %4
  br i1 %exitcond, label %for.inc9.loopexit, label %for.body3
since we cannot split at 
  %add6 = add nuw nsw i32 %j.018, 1
we give up in this case. But now that i think about it counting uses may not be the right method to check if we can split the inner loop latch. Consider the following valid loop were we fail with this check-
    for(int i=0;i<100;i++)
    for(int j=0;j<100;j++)
       A[j][i] = A[j][i]+k;
here we get the inner loop latch as -
  for.body3:                                        ; preds = %for.body3, %for.cond1.preheader
  %j.015 = phi i32 [ 0, %for.cond1.preheader ], [ %inc, %for.body3 ]
  %arrayidx4 = getelementptr inbounds [100 x [100 x i32]]* @A, i32 0, i32 %j.015, i32 %i.016
  %1 = load i32* %arrayidx4, align 4, !tbaa !1
  %add = add nsw i32 %0, %1
  store i32 %add, i32* %arrayidx4, align 4, !tbaa !1
  %inc = add nuw nsw i32 %j.015, 1
  %exitcond = icmp eq i32 %inc, 100
  br i1 %exitcond, label %for.inc7, label %for.body3
This could have been splitted at 
  %inc = add nuw nsw i32 %j.015, 1
but we fail as we find more than 1 uses.
Modified the logic to check tightly grouped inner loop latch which can be splitted.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:459
@@ +458,3 @@
+  if (!OuterLoop->getLoopPreheader() || !InnerLoop->getLoopPreheader()) {
+    DEBUG(dbgs() << "loop control flow is not understood");
+    return false;
----------------
hfinkel wrote:
> Let's say, "Inner or outer loops lack a preheader"?
> 
> Also, for the future, adding a preheader when one is not present is pretty easy (you just need to call InsertPreheaderForLoop from llvm/Transforms/Utils/LoopUtils.h), we this is a limitation that should be removed sooner rather than later (although after the initial commit is okay).
Updated code to add a preheader when not present.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:500
@@ +499,3 @@
+    Instruction *UseInstr = cast<Instruction>(*IB);
+    if (isa<GetElementPtrInst>(UseInstr)) {
+      GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(UseInstr);
----------------
hfinkel wrote:
>   if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(UseInstr)) {
>     ...
>   }
Done.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:502
@@ +501,3 @@
+      GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(UseInstr);
+      if (GEP->getOperand(GEP->getNumOperands() - 1) == IV)
+        goodOrder += 1;
----------------
hfinkel wrote:
> What happens if it is not the IV directly, but some expression of the IV? I think you'd be better off using ScalarEvolution here, get the AddRec of the GEP, and see if the "outer" AddRec is provided in terms of the SCEV of the IV (or something like that).
Updated code to use SCEV to get the loop from which we get the operand to decide it is a good or bad load.
Able to handle code like-
  for(i=0;i<N;i+=1)
    for(j=0;j<N;j++)
        A[j-1][i-1] = A[j-1][i-1]+C[j-1][i-1];
after change. This now gets vectorized after interchange.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:550
@@ +549,3 @@
+  // TODO: This splitting logic may not work always. Fix this.
+  splitInnerLoopLatch(innerIndexVar);
+  DEBUG(dbgs() << "splitInnerLoopLatch Done\n");
----------------
hfinkel wrote:
> Use SplitBlock from include/llvm/Transforms/Utils/BasicBlockUtils.h?
> 
> (same for other functions below)?
Updated code.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:725
@@ +724,3 @@
+      !innerLoopHeaderSuccBI)
+    assert(0 && "This should not be triggered.We have already modified parts "
+                "of the loop");
----------------
hfinkel wrote:
> Use llvm_unreachable, not assert(0 &&
Done.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:817
@@ +816,3 @@
+       I != E; ++I) {
+    if (!isa<PHINode>(I))
+      continue;
----------------
hfinkel wrote:
> PHIs are always at the beginning of the block; once you hit the first non-PHI, you can exit the loop (you should never find another).
Yes you are right. Modified code.

http://reviews.llvm.org/D7499

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/