[PATCH] Reimplement heuristic for estimating complete-unroll optimization effects.

Mon Feb 23 19:39:48 PST 2015

Hi Chandler,

Thanks for the review, please find my replies  inline.


================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:488-492
@@ +487,7 @@
+    EarlyExitFound = false;
+    // Estimate number of instructions, that could be simplified if we replace a
+    // load with the corresponding constant. Since the same load will take
+    // different values on different iterations, we have to go through all
+    // loop's
+    // iterations here.
+    for (Iteration = 0; Iteration < TripCount; ++Iteration) {
----------------
chandlerc wrote:
> This comment seems out of date.
It was just weirdly formatted:)

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:494-495
@@ +493,4 @@
+    for (Iteration = 0; Iteration < TripCount; ++Iteration) {
+      if (EarlyExitFound)
+        break;
+
----------------
chandlerc wrote:
> There is essentially no documentation of "EarlyExitFound" or why this behavior is correct.
> 
> Generally, I'm worried about all the 'break' and loop control flow here. I think this is going to be a bit easier to understand if our early-stop points just return.
I will add more comments regarding EarlyExitFound flag.

The logic behind it is the following: when at some point we managed to resolve conditional branch in some block, and this branch leads us out of the loop, we don't want to analyze further iterations - if we unroll the loop, they will become a dead code and will be removed. However, we do want to finish estimation of the current iteration - that's why we don't just return (otherwise, we'll miss the cost of blocks, remaining in the worklist).

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:497
@@ +496,3 @@
+
+      BBSetVector BBWorklist;
+      BBWorklist.insert(L->getHeader());
----------------
chandlerc wrote:
> Rather than declaring this in the iteration loop, please declare it in simulateLoop and clear it on each iteration so that we re-use allocated storage. We're *very* likely to put roughly the same number of BBs on the worklist each trip through.
Good point, thanks. Fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:505-512
@@ +504,10 @@
+
+        analyzeBlock(BB);
+
+        // If unrolled body turns out to be too big, bail out.
+        if (UnrolledLoopSize - NumberOfOptimizedInstructions >
+            MaxUnrolledLoopSize)
+          return;
+
+        addBBSuccessors(BB, &BBWorklist);
+      }
----------------
chandlerc wrote:
> I think it would be more clear to inline the analyzeBlock and addBBSuccessors here... At the least, inlining addBBSuccessors avoids passing around the worklist.
Fixed, thanks.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:515-519
@@ -469,10 +514,7 @@
 
-    // We start by adding all loads to the worklist.
-    for (auto &LoadDescr : LoadBaseAddresses) {
-      LoadInst *LI = LoadDescr.first;
-      SimplifiedValues[LI] = computeLoadValue(LI, Iteration);
-      if (CountedInstructions.insert(LI).second)
-        NumberOfOptimizedInstructions += TTI.getUserCost(LI);
-
-      for (User *U : LI->users())
-        Worklist.insert(cast<Instruction>(U));
+      // If we found no optimization opportunities on the first iteration, we
+      // won't find them on later ones too.
+      if (!NumberOfOptimizedInstructions) {
+        UnrolledLoopSize = UINT_MAX;
+        break;
+      }
----------------
chandlerc wrote:
> Why not return?
Fixed, thanks.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:524-531
@@ -480,13 +523,10 @@
 
-    // And then we try to simplify every user of every instruction from the
-    // worklist. If we do simplify a user, add it to the worklist to process
-    // its users as well.
-    while (!Worklist.empty()) {
-      Instruction *I = Worklist.pop_back_val();
-      if (!L->contains(I))
-        continue;
-      if (!visit(I))
-        continue;
-      for (User *U : I->users())
-        Worklist.insert(cast<Instruction>(U));
-    }
+  // Visit all GEPs in the loop L, and find those that after complete loop
+  // unrolling would become constants, or BaseAddress+Constant.  For this to
+  // happen, a GEPs SCEV expression should contain at most one AddRec
+  // expression, and the loop corresponding to this expression should be L. The
+  // rest SCEV sub-expressions should be either constants, or ScevUnknown
+  // (which would become the base address).
+  // If the expression contains the base address, then after subtracting it, we
+  // should get AddRec with constant step and start.
+  void cacheSCEVResults() {
----------------
chandlerc wrote:
> I find it a bit surprising to do this caching up-front. I would expect it to work better to lazily populate the cache based on the expressions we see.
> 
> If that makes sense, I think it also makes sense to factor this into a completely separate change that just wraps up SCEV queries in a lazily populated cache, specifically for the purpose of simulation? Might make sense to fully sink this into SCEV, I don't know and would defer to others that know SCEV better.
Initially, I didn't want to mix working with SCEV with everything else, especially taking into account this data will be the same across all iterations and will be needed at every iteration. But now I'm thinking that moving this to visitGetElementPtr would simplify the code. I'll do that.

As for the second part - SCEV is already a cache (i.e. Value --> SCEVExpression map ). But what we store in our cache is the result of some additional computations, which are very task-specific (we're looking for GEPs, whose SCEV expressions contain at most one SCEVAddRec and this AddRec relates to the particular loop). These computations seem expensive enough to cache their results, but they are not general enough to move that to SCEV itself.

http://reviews.llvm.org/D7837

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/