[PATCH] [Patch] Loop Interchange Pass

Thu Feb 26 21:11:52 PST 2015

Hi Hal,
Please find my comments inline. Updated the patch as per review comments and fixed few issues found during llvm lnt regression. 
The current version of loop interchange gives some 30% improvement in execution time in 2 benchmarks. This is because it contains code fragments like -

  for (i = 0; i < _PB_N; i++)
   for (j = 0; j < _PB_N; j++)
     x[i] = x[i] + beta * A[j][i] * y[j];

which gets benefited after interchange.
There are few compile time regression which can be because of the heavy legality checks in loop interchange. I will try to fix this in next iteration. 
Apart from this we found a crash in Dependency Analysis module which I'm planning to fix seperatly as i need to understand it in more detail. Will raise a bug on the same.

F407640: LNT_result.png <http://reviews.llvm.org/F407640>

Thanks and Regards
Karthik Bhat


================
Comment at: lib/Transforms/IPO/PassManagerBuilder.cpp:262
@@ -261,3 +261,3 @@
   addExtensionsToPM(EP_ScalarOptimizerLate, MPM);
-
+  MPM.add(createLoopInterchangePass());
   if (RerollLoops)
----------------
hfinkel wrote:
> I'd certainly like to have this on by default eventually, but we should be more conservative at first. Please add a command-line flag to enable this (there are several in this file already), so we can do further testing.
Sure. I had added it for testing performance forgot to revert before checkin. Will add a command line flag and disable it by default.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:11
@@ +10,3 @@
+// This Pass handles loop interchange transform. This interchanges the inner
+// and outer loop if interchanging can result in better cache hits or fine
+// grained parallelism.
----------------
hfinkel wrote:
> You should use the 'cache-friendly memory access pattern' terminology here too.
Done.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:50
@@ +49,3 @@
+typedef SmallVector<Loop *, 8> LoopVector;
+typedef std::vector<std::vector<char>> CharMatrix;
+class LoopInterchange;
----------------
hfinkel wrote:
> Is this matrix generally sparse? (or could we make it sparse by picking some default). If so, is this the right data structure?
Yes this matrix can be sparse depending on the dependence carried by the loop. I will check more on this front. Have added a TODO for now.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:74
@@ +73,3 @@
+
+bool populateDependencyMatrix(CharMatrix &DepMatrix, unsigned Level, Loop *L,
+                              DependenceAnalysis *DA) {
----------------
hfinkel wrote:
> I'm somewhat worried about doing this eagerly for all loops; what if they're really large with lots of memory accesses? Maybe we should have a cutoff?
> 
Makes sense. Added a limit of max 10 loops(Columns in the dependency matrix) and 100 dependencies(Rows of the dependency matrix).

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:652
@@ +651,3 @@
+    PhiCount++;
+  }
+  return PhiCount;
----------------
hfinkel wrote:
> No need for { } here.
Done.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:855
@@ +854,3 @@
+  // TODO: Improve this heuristic to catch more cases.
+  // If the inner loop is loop independent or doesn't carry any dependency it is
+  // profitable to move this to outer position.
----------------
hfinkel wrote:
> I don't really understand this comment. I think we can assume that LICM has run first. (and if this pass detects loop-invariant code better than LICM, that is another problem to fix, but not here).
Hi Hal,
Consider the below code of matrix multiplication-
  for(int i=0;i<N;i++)
    for(int j=0;j<N;j++)
      for(int k=0;k<N;k++)
        A[i][j]= A[i][j]+B[i][k]*C[k][j]
In this example the direction vector would be - 
[= = |<] (i.e. '=' dependency in i, '=' dependency in j and is loop independent dependency in k). The LICM pass would move getElementPointer for A[i][j] outside the inner loop but it cannot move the complete statement outside the inner loop.
Now since vectorizer only works on inner loop. The above code is not vectorized for i,j. But if we interchange the loops to -
  for(int k=0;k<N;k++)
    for(int i=0;i<N;i++)
      for(int j=0;j<N;j++)
        A[i][j]= A[i][j]+B[i][k]*C[k][j]
now the loop gets vectorized. 
It is mostly profitable to keep loop independent dependencies such as the above at the outermost possible level. We try to achieve the same here.

================
Comment at: lib/Transforms/Scalar/LoopInterchange.cpp:916
@@ +915,3 @@
+  for (Loop::iterator I = InnerLoop->begin(), E = InnerLoop->end(); I != E;
+       ++I) {
+    OuterLoop->addChildLoop(InnerLoop->removeChildLoop(I));
----------------
hfinkel wrote:
> No need for the { }
Done.

http://reviews.llvm.org/D7499

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/