[PATCH] Reimplement heuristic for estimating complete-unroll optimization effects.

Mon Feb 23 18:44:56 PST 2015

Initial comments below. I'd also like to see some compile time impact measurement, but maybe wait to settle the design a bit first.


================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:488-492
@@ +487,7 @@
+    EarlyExitFound = false;
+    // Estimate number of instructions, that could be simplified if we replace a
+    // load with the corresponding constant. Since the same load will take
+    // different values on different iterations, we have to go through all
+    // loop's
+    // iterations here.
+    for (Iteration = 0; Iteration < TripCount; ++Iteration) {
----------------
This comment seems out of date.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:494-495
@@ +493,4 @@
+    for (Iteration = 0; Iteration < TripCount; ++Iteration) {
+      if (EarlyExitFound)
+        break;
+
----------------
There is essentially no documentation of "EarlyExitFound" or why this behavior is correct.

Generally, I'm worried about all the 'break' and loop control flow here. I think this is going to be a bit easier to understand if our early-stop points just return.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:497
@@ +496,3 @@
+
+      BBSetVector BBWorklist;
+      BBWorklist.insert(L->getHeader());
----------------
Rather than declaring this in the iteration loop, please declare it in simulateLoop and clear it on each iteration so that we re-use allocated storage. We're *very* likely to put roughly the same number of BBs on the worklist each trip through.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:505-512
@@ +504,10 @@
+
+        analyzeBlock(BB);
+
+        // If unrolled body turns out to be too big, bail out.
+        if (UnrolledLoopSize - NumberOfOptimizedInstructions >
+            MaxUnrolledLoopSize)
+          return;
+
+        addBBSuccessors(BB, &BBWorklist);
+      }
----------------
I think it would be more clear to inline the analyzeBlock and addBBSuccessors here... At the least, inlining addBBSuccessors avoids passing around the worklist.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:515-519
@@ -469,10 +514,7 @@
 
-    // We start by adding all loads to the worklist.
-    for (auto &LoadDescr : LoadBaseAddresses) {
-      LoadInst *LI = LoadDescr.first;
-      SimplifiedValues[LI] = computeLoadValue(LI, Iteration);
-      if (CountedInstructions.insert(LI).second)
-        NumberOfOptimizedInstructions += TTI.getUserCost(LI);
-
-      for (User *U : LI->users())
-        Worklist.insert(cast<Instruction>(U));
+      // If we found no optimization opportunities on the first iteration, we
+      // won't find them on later ones too.
+      if (!NumberOfOptimizedInstructions) {
+        UnrolledLoopSize = UINT_MAX;
+        break;
+      }
----------------
Why not return?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:524-531
@@ -480,13 +523,10 @@
 
-    // And then we try to simplify every user of every instruction from the
-    // worklist. If we do simplify a user, add it to the worklist to process
-    // its users as well.
-    while (!Worklist.empty()) {
-      Instruction *I = Worklist.pop_back_val();
-      if (!L->contains(I))
-        continue;
-      if (!visit(I))
-        continue;
-      for (User *U : I->users())
-        Worklist.insert(cast<Instruction>(U));
-    }
+  // Visit all GEPs in the loop L, and find those that after complete loop
+  // unrolling would become constants, or BaseAddress+Constant.  For this to
+  // happen, a GEPs SCEV expression should contain at most one AddRec
+  // expression, and the loop corresponding to this expression should be L. The
+  // rest SCEV sub-expressions should be either constants, or ScevUnknown
+  // (which would become the base address).
+  // If the expression contains the base address, then after subtracting it, we
+  // should get AddRec with constant step and start.
+  void cacheSCEVResults() {
----------------
I find it a bit surprising to do this caching up-front. I would expect it to work better to lazily populate the cache based on the expressions we see.

If that makes sense, I think it also makes sense to factor this into a completely separate change that just wraps up SCEV queries in a lazily populated cache, specifically for the purpose of simulation? Might make sense to fully sink this into SCEV, I don't know and would defer to others that know SCEV better.

http://reviews.llvm.org/D7837

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/