[PATCH] Reimplement heuristic for estimating complete-unroll optimization effects.

Thu Apr 9 13:54:00 PDT 2015

Ok, full pass of comments below.

Generally, this is definitely looking better to me. I think there is still a number of things that could be simplified or refactored, but that can come later. The stuff below is just to try and get this iteration ready to go.


================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:254-255
@@ +253,4 @@
+
+  /// \brief Used for filtering out SCEV expressions with two or more AddRec
+  ///subexpressions.
+  ///
----------------
Indentation seems off here..

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:257-261
@@ +256,7 @@
+  ///
+  /// haveSeenAR is used to filter out complicated SCEV expressions, having
+  /// several AddRec sub-expressions. We don't handle them, because unrolling one
+  /// loop wouldn't help to replace only one of these inductions with a constant,
+  /// and consequently, the expression would remain non-constant.
+  bool haveSeenAR;
+
----------------
No need to repeat the variable name.

I'd also call the variable "HaveSeenAR" for consistency with LLVM's naming conventions.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:272-274
@@ -271,3 +271,5 @@
+
   FindConstantPointers(const Loop *loop, ScalarEvolution &SE)
-      : LoadCanBeConstantFolded(true), IndexIsConstant(true), L(loop), SE(SE) {}
+      : IndexIsConstant(true), haveSeenAR(false), BaseAddress(nullptr),
+        L(loop), SE(SE) {}
 
----------------
While you're here, can you make the argument be 'L' instead of 'loop'?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:284-285
@@ -279,1 +283,4 @@
+      // - it's too complicated.
+      if (BaseAddress)
+        return IndexIsConstant = false;
       BaseAddress = SC->getValue();
----------------
I somewhat dislike an assignment in a return expression. It's really hard to see when reading the code. Could you instead set the variable and then return? Or maybe use a function that aborts the traversal?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:287
@@ -279,5 +286,3 @@
       BaseAddress = SC->getValue();
-      LoadCanBeConstantFolded =
-          IndexIsConstant && isLoadFromConstantInitializer(BaseAddress);
       return false;
     }
----------------
This is the only raw false here. Everything else that returns false sets IndexIsConstant to false first. If this is correct, could you comment on why?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:363
@@ +362,3 @@
+  // unrolling, save the corresponding data.
+  DenseMap<Value *, SCEVGEPDescriptor> SCEVCache;
+
----------------
SmallDenseMap?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:413-415
@@ +412,5 @@
+  bool visitLoad(LoadInst &I) {
+    Value *AddrOp = I.getPointerOperand();
+    if (SimplifiedValues[AddrOp])
+      AddrOp = SimplifiedValues[AddrOp];
+    if (!SCEVCache.count(AddrOp))
----------------
Please don't use DenseMap like this. You're inserting a null value for every AddrOp you visit.

You want to use exactly the logic used above for the LHS and RHS expresisons:

  Value *AddrOp = I.getPointerOperand();
  if (!isa<Constant>(AddrOp))
    if (Constant *SimplifiedAddrOp = SimplifiedValues.lookup(AddrOp))
      AddrOp = SimplifiedAddrOp;

Better yet, just hoist this entire pattern into a helper called 'simplifyValue' or some such?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:416-418
@@ -383,1 +415,5 @@
+      AddrOp = SimplifiedValues[AddrOp];
+    if (!SCEVCache.count(AddrOp))
+      return false;
+    SCEVGEPDescriptor d = SCEVCache[AddrOp];
 
----------------
Don't do two map lookups. Lookup the key once (using find because you don't just get a null pointer) and then test the iterator and use the iterator.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:420
@@ -383,4 +419,3 @@
 
-    auto GV = dyn_cast<GlobalVariable>(BaseAddr);
-    if (!GV)
-      return nullptr;
+    auto GV = dyn_cast_or_null<GlobalVariable>(d.BaseAddr);
+    if (!GV || !GV->hasInitializer())
----------------
In what case is the base address null? It doesn't really make sense to add null base addr records to the cache to me?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:429-430
@@ -407,6 +428,4 @@
 
-    if (const SCEVConstant *StartSE = dyn_cast<SCEVConstant>(AR->getStart()))
-      StartC = StartSE->getValue()->getValue();
-    else
-      return nullptr;
+    unsigned Start = d.Start.getLimitedValue();
+    unsigned Step = d.Step.getLimitedValue();
 
----------------
What ensures that this is always safe?

If something does ensure that this is always safe, I must assume it is when populating the structure. If that's the case, why can't we only store the unsigned variant?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:433
@@ -413,6 +432,3 @@
     unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits() / 8U;
-    unsigned Start = StartC.getLimitedValue();
-    unsigned Step = StepC.getLimitedValue();
-
     unsigned Index = (Start + Step * Iteration) / ElemSize;
     if (Index >= CDS->getNumElements())
----------------
What happens when "Step * Iteration" overflows? What about when "(Start + Step * Iteration)" overflows?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:437-441
@@ -420,2 +436,7 @@
 
     Constant *CV = CDS->getElementAsConstant(Index);
+    if (CV)
+      SimplifiedValues[cast<Value>(&I)] = CV;
+
+    NumberOfOptimizedInstructions += TTI.getUserCost(&I);
+    return true;
----------------
Under what circumstances is this not a constant?

If this is genuinely not a constant, should we really be considering the load "optimized"?


(Also, the cast<Value> shouldn't be needed?)

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:449-454
@@ +448,8 @@
+  // constant later.
+  // SCEV expression of such GEP should contain at most one AddRec expression,
+  // and the loop corresponding to this expression should be L. The rest SCEV
+  // sub-expressions should be either constants, or ScevUnknown (which would
+  // become the base address).  If the expression contains the base address,
+  // then after subtracting it, we should get AddRec with constant step and
+  // start.
+  bool visitGetElementPtr(GetElementPtrInst &I) {
----------------
Would this second comment paragraph go better on the SCEV evaluation tool above? Or sunk into the implementation below? It seems kind of out of place here.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:456
@@ +455,3 @@
+  bool visitGetElementPtr(GetElementPtrInst &I) {
+    Value *V = cast<Value>(&I);
+    if (!VisitedGEPs.insert(V).second)
----------------
Shouldn't you rinse this through the simplification map?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:457-458
@@ +456,4 @@
+    Value *V = cast<Value>(&I);
+    if (!VisitedGEPs.insert(V).second)
+      return false;
+
----------------
You should probably comment specifically that we expect to re-visit the same instructions repeatedly (once per iteration) and so we only want to do iteration-independent SCEV queries and computations once. I'd also probably extract all of the SCEV computation stuff below into a separate member function that you can comment as being iteration independent, etc. Then you can structure the visit somewhat more naturally.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:489
@@ -424,1 +488,3 @@
+    SCEVCache[V] = d;
+    return false;
   }
----------------
If the base address is a constant the GEP will also fold away though, so we should be able to mark it as optimized? (and we should be able to do this on each iteration)

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:498-499
@@ -454,18 +497,4 @@
 
-  // Given a list of loads that could be constant-folded (LoadBaseAddresses),
-  // estimate number of optimized instructions after substituting the concrete
-  // values for the given Iteration. Also track how many instructions become
-  // dead through this process.
-  unsigned estimateNumberOfOptimizedInstructions(unsigned Iteration) {
-    // We keep a set vector for the worklist so that we don't wast space in the
-    // worklist queuing up the same instruction repeatedly. This can happen due
-    // to multiple operands being the same instruction or due to the same
-    // instruction being an operand of lots of things that end up dead or
-    // simplified.
-    SmallSetVector<Instruction *, 8> Worklist;
-
-    // Clear the simplified values and counts for this iteration.
-    SimplifiedValues.clear();
-    CountedInstructions.clear();
-    NumberOfOptimizedInstructions = 0;
+  /// \brief Count the number of optimized instructions.
+  unsigned NumberOfOptimizedInstructions;
 
----------------
I find it really weird to count optimized instructions rather than counting instructions that will remain *after* optimizations.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:504-510
@@ +503,9 @@
+
+  // Complete loop unrolling can make some loads constant, and we need to know
+  // if that would expose any further optimization opportunities.  This routine
+  // estimates this optimization.  It assigns computed number of instructions,
+  // that potentially might be optimized away, to NumberOfOptimizedInstructions,
+  // and total number of instructions to UnrolledLoopSize (not counting blocks
+  // that won't be reached, if we were able to compute the condition).
+  void analyzeLoop() {
+    BBSetVector BBWorklist;
----------------
Rather than setting UnrolledLoopSize to UINT_MAX below to signal some inability to reasonably compute the unrolled size estimation, why not return true or false here? If this returns false, we have no useful data about the loop. Move on. If this returns true, then you can query for the detailed numbers.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:515-520
@@ -478,5 +514,8 @@
 
-      for (User *U : LI->users())
-        Worklist.insert(cast<Instruction>(U));
+    // Don't simulate loops with a big or unknown tripcount
+    if (!UnrollMaxIterationsCountToAnalyze || !TripCount ||
+        TripCount > UnrollMaxIterationsCountToAnalyze) {
+      UnrolledLoopSize = UINT_MAX;
+      return;
     }
 
----------------
I think you should have a FIXME to eventually remove the max iteration count to analyze. Once we shake the bugs out of the algorithm, it shouldn't be necessary. We should be willing to analyze any number of iterations as long as the un-optimized resulting instruction count is below a threshold.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:562-566
@@ -543,1 +561,7 @@
+
+    // If we can overflow computing percentage of optimized instructions, just
+    // give a conservative answer. Anyway, we don't want to deal with such a
+    // big loops.
+    if (NumberOfOptimizedInstructions > UINT_MAX / 100)
+      NumberOfOptimizedInstructions = 0;
   }
----------------
I would handle all of this below where you're actually dealing with percentages. Handling it here seems really surprising and hard to understand.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:690-703
@@ +689,16 @@
+
+  unsigned PercentOfOptimizedInstructions =
+      NumberOfOptimizedInstructions * 100 /
+      UnrolledSize; // The previous check guards us from div by 0
+  if (UnrolledSize <= AbsoluteThreshold &&
+      PercentOfOptimizedInstructions >= PercentOfOptimizedForCompleteUnroll) {
+    DEBUG(dbgs() << "  Can fully unroll, because unrolling will help removing "
+                 << PercentOfOptimizedInstructions
+                 << "% instructions (threshold: "
+                 << PercentOfOptimizedForCompleteUnroll << "%)\n");
+    DEBUG(dbgs() << "  Unrolled size (" << UnrolledSize
+                 << ") is less than the threshold (" << AbsoluteThreshold
+                 << ").\n");
+    return true;
+  }
+
----------------
Can you explain some of your motivations for having the double threshold and percentage query? It seems really awkward to implement, and so I'm curious what the need is here. If we could get around with just having a flat threshold, it'd make me happy. =]

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:832-842
@@ -800,10 +831,13 @@
   if (TripCount && Count == TripCount) {
-    if (Threshold != NoThreshold && UnrolledSize > Threshold) {
-      DEBUG(dbgs() << "  Too large to fully unroll with count: " << Count
-                   << " because size: " << UnrolledSize << ">" << Threshold
-                   << "\n");
-      Unrolling = Partial;
-    } else {
+    UnrollAnalyzer UA(L, TripCount, *SE, TTI, AbsoluteThreshold);
+    UA.analyzeLoop();
+    if (canUnrollCompletely(
+            L, Threshold, AbsoluteThreshold,
+            std::min<unsigned>(UnrolledSize, UA.UnrolledLoopSize),
+            UA.NumberOfOptimizedInstructions,
+            PercentOfOptimizedForCompleteUnroll)) {
       Unrolling = Full;
+    } else {
+      Unrolling = Partial;
     }
   } else if (TripCount && Count < TripCount) {
----------------
I may just be forgetting where this is handled, but do we somewhere short-circuit the case where the total size of the loop body * the trip count is already below the threshold? Because we should. We shouldn't go and do the expensive analysis unless we at least have a large enough loop to be on the fence.

http://reviews.llvm.org/D8816

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/