[PATCH] Reimplement heuristic for estimating complete-unroll optimization effects.

Fri Apr 10 21:42:36 PDT 2015

Hi Chandler,
Please find my answers inline. The patch is updated correspondingly, is it ok to commit it?


================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:254-255
@@ +253,4 @@
+
+  /// \brief Used for filtering out SCEV expressions with two or more AddRec
+  ///subexpressions.
+  ///
----------------
chandlerc wrote:
> Indentation seems off here..
Fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:257-261
@@ +256,7 @@
+  ///
+  /// haveSeenAR is used to filter out complicated SCEV expressions, having
+  /// several AddRec sub-expressions. We don't handle them, because unrolling one
+  /// loop wouldn't help to replace only one of these inductions with a constant,
+  /// and consequently, the expression would remain non-constant.
+  bool haveSeenAR;
+
----------------
chandlerc wrote:
> No need to repeat the variable name.
> 
> I'd also call the variable "HaveSeenAR" for consistency with LLVM's naming conventions.
Fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:272-274
@@ -271,3 +271,5 @@
+
   FindConstantPointers(const Loop *loop, ScalarEvolution &SE)
-      : LoadCanBeConstantFolded(true), IndexIsConstant(true), L(loop), SE(SE) {}
+      : IndexIsConstant(true), haveSeenAR(false), BaseAddress(nullptr),
+        L(loop), SE(SE) {}
 
----------------
chandlerc wrote:
> While you're here, can you make the argument be 'L' instead of 'loop'?
Fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:284-285
@@ -279,1 +283,4 @@
+      // - it's too complicated.
+      if (BaseAddress)
+        return IndexIsConstant = false;
       BaseAddress = SC->getValue();
----------------
chandlerc wrote:
> I somewhat dislike an assignment in a return expression. It's really hard to see when reading the code. Could you instead set the variable and then return? Or maybe use a function that aborts the traversal?
Fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:287
@@ -279,5 +286,3 @@
       BaseAddress = SC->getValue();
-      LoadCanBeConstantFolded =
-          IndexIsConstant && isLoadFromConstantInitializer(BaseAddress);
       return false;
     }
----------------
chandlerc wrote:
> This is the only raw false here. Everything else that returns false sets IndexIsConstant to false first. If this is correct, could you comment on why?
I rewrote return statements in the function, now they use raw true/false values. I also return `false` instead of `true` in `if (isa<SCEVConstant>(S))` - that doesn't matter since we can't "follow" into the SCEVConstant anyway, but `false` is more consistent here.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:363
@@ +362,3 @@
+  // unrolling, save the corresponding data.
+  DenseMap<Value *, SCEVGEPDescriptor> SCEVCache;
+
----------------
chandlerc wrote:
> SmallDenseMap?
Makes sense, fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:413-415
@@ +412,5 @@
+  bool visitLoad(LoadInst &I) {
+    Value *AddrOp = I.getPointerOperand();
+    if (SimplifiedValues[AddrOp])
+      AddrOp = SimplifiedValues[AddrOp];
+    if (!SCEVCache.count(AddrOp))
----------------
chandlerc wrote:
> Please don't use DenseMap like this. You're inserting a null value for every AddrOp you visit.
> 
> You want to use exactly the logic used above for the LHS and RHS expresisons:
> 
>   Value *AddrOp = I.getPointerOperand();
>   if (!isa<Constant>(AddrOp))
>     if (Constant *SimplifiedAddrOp = SimplifiedValues.lookup(AddrOp))
>       AddrOp = SimplifiedAddrOp;
> 
> Better yet, just hoist this entire pattern into a helper called 'simplifyValue' or some such?
Thanks, fixed! Though I didn't add a separate  function for it.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:416-418
@@ -383,1 +415,5 @@
+      AddrOp = SimplifiedValues[AddrOp];
+    if (!SCEVCache.count(AddrOp))
+      return false;
+    SCEVGEPDescriptor d = SCEVCache[AddrOp];
 
----------------
chandlerc wrote:
> Don't do two map lookups. Lookup the key once (using find because you don't just get a null pointer) and then test the iterator and use the iterator.
Fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:420
@@ -383,4 +419,3 @@
 
-    auto GV = dyn_cast<GlobalVariable>(BaseAddr);
-    if (!GV)
-      return nullptr;
+    auto GV = dyn_cast_or_null<GlobalVariable>(d.BaseAddr);
+    if (!GV || !GV->hasInitializer())
----------------
chandlerc wrote:
> In what case is the base address null? It doesn't really make sense to add null base addr records to the cache to me?
You are right, it can't be null (it's checked when we prepare a new entry to SCEVCache). Fixed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:429-430
@@ -407,6 +428,4 @@
 
-    if (const SCEVConstant *StartSE = dyn_cast<SCEVConstant>(AR->getStart()))
-      StartC = StartSE->getValue()->getValue();
-    else
-      return nullptr;
+    unsigned Start = d.Start.getLimitedValue();
+    unsigned Step = d.Step.getLimitedValue();
 
----------------
chandlerc wrote:
> What ensures that this is always safe?
> 
> If something does ensure that this is always safe, I must assume it is when populating the structure. If that's the case, why can't we only store the unsigned variant?
Thanks, I added constraints on the operands. Also, we now store unsigned invariant instead of APInt, as you suggested.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:433
@@ -413,6 +432,3 @@
     unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits() / 8U;
-    unsigned Start = StartC.getLimitedValue();
-    unsigned Step = StepC.getLimitedValue();
-
     unsigned Index = (Start + Step * Iteration) / ElemSize;
     if (Index >= CDS->getNumElements())
----------------
chandlerc wrote:
> What happens when "Step * Iteration" overflows? What about when "(Start + Step * Iteration)" overflows?
I made Index, Step, and Start uint64_t, while value in Start, Step and Iteration can't exceed 32-bit maximum. That should prevent overflows.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:437-441
@@ -420,2 +436,7 @@
 
     Constant *CV = CDS->getElementAsConstant(Index);
+    if (CV)
+      SimplifiedValues[cast<Value>(&I)] = CV;
+
+    NumberOfOptimizedInstructions += TTI.getUserCost(&I);
+    return true;
----------------
chandlerc wrote:
> Under what circumstances is this not a constant?
> 
> If this is genuinely not a constant, should we really be considering the load "optimized"?
> 
> 
> (Also, the cast<Value> shouldn't be needed?)
Thanks, fixed!

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:457-458
@@ +456,4 @@
+    Value *V = cast<Value>(&I);
+    if (!VisitedGEPs.insert(V).second)
+      return false;
+
----------------
chandlerc wrote:
> You should probably comment specifically that we expect to re-visit the same instructions repeatedly (once per iteration) and so we only want to do iteration-independent SCEV queries and computations once. I'd also probably extract all of the SCEV computation stuff below into a separate member function that you can comment as being iteration independent, etc. Then you can structure the visit somewhat more naturally.
I think that we want to return to the original `cacheSCEVResults` approach. With that, we'd explicitly distinguish actions we want to do once (compute SCEVs and store interesting ones) from actions we want to perform on every simulated loop iteration (like trying to optimize LoadInst/BinaryOp/etc.). With that, added `cacheSCEVResults` and removed `visitGetElementPtr`.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:489
@@ -424,1 +488,3 @@
+    SCEVCache[V] = d;
+    return false;
   }
----------------
chandlerc wrote:
> If the base address is a constant the GEP will also fold away though, so we should be able to mark it as optimized? (and we should be able to do this on each iteration)
We don't support such optimization for GEPS (for now). We can add it later, and it'll naturally go to `visitGetElementPtr`, which is currently removed.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:498-499
@@ -454,18 +497,4 @@
 
-  // Given a list of loads that could be constant-folded (LoadBaseAddresses),
-  // estimate number of optimized instructions after substituting the concrete
-  // values for the given Iteration. Also track how many instructions become
-  // dead through this process.
-  unsigned estimateNumberOfOptimizedInstructions(unsigned Iteration) {
-    // We keep a set vector for the worklist so that we don't wast space in the
-    // worklist queuing up the same instruction repeatedly. This can happen due
-    // to multiple operands being the same instruction or due to the same
-    // instruction being an operand of lots of things that end up dead or
-    // simplified.
-    SmallSetVector<Instruction *, 8> Worklist;
-
-    // Clear the simplified values and counts for this iteration.
-    SimplifiedValues.clear();
-    CountedInstructions.clear();
-    NumberOfOptimizedInstructions = 0;
+  /// \brief Count the number of optimized instructions.
+  unsigned NumberOfOptimizedInstructions;
 
----------------
chandlerc wrote:
> I find it really weird to count optimized instructions rather than counting instructions that will remain *after* optimizations.
I can change it, but in fact it doesn't sound so weird to me:)

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:504-510
@@ +503,9 @@
+
+  // Complete loop unrolling can make some loads constant, and we need to know
+  // if that would expose any further optimization opportunities.  This routine
+  // estimates this optimization.  It assigns computed number of instructions,
+  // that potentially might be optimized away, to NumberOfOptimizedInstructions,
+  // and total number of instructions to UnrolledLoopSize (not counting blocks
+  // that won't be reached, if we were able to compute the condition).
+  void analyzeLoop() {
+    BBSetVector BBWorklist;
----------------
chandlerc wrote:
> Rather than setting UnrolledLoopSize to UINT_MAX below to signal some inability to reasonably compute the unrolled size estimation, why not return true or false here? If this returns false, we have no useful data about the loop. Move on. If this returns true, then you can query for the detailed numbers.
That's a good idea, fixed according to your suggestion!

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:515-520
@@ -478,5 +514,8 @@
 
-      for (User *U : LI->users())
-        Worklist.insert(cast<Instruction>(U));
+    // Don't simulate loops with a big or unknown tripcount
+    if (!UnrollMaxIterationsCountToAnalyze || !TripCount ||
+        TripCount > UnrollMaxIterationsCountToAnalyze) {
+      UnrolledLoopSize = UINT_MAX;
+      return;
     }
 
----------------
chandlerc wrote:
> I think you should have a FIXME to eventually remove the max iteration count to analyze. Once we shake the bugs out of the algorithm, it shouldn't be necessary. We should be willing to analyze any number of iterations as long as the un-optimized resulting instruction count is below a threshold.
Is it possible that we can optimize all instructions in the loop body, and thus don't reach the threshold? I think it isn't, because we would have at least one instruction (branch) in the loop body, but I'm not confident here - maybe there could be some weird cases (i.e. cost of the branch is 0). If it's guaranteed that cost of the loop body is always > 0, then we can remove this limit.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:562-566
@@ -543,1 +561,7 @@
+
+    // If we can overflow computing percentage of optimized instructions, just
+    // give a conservative answer. Anyway, we don't want to deal with such a
+    // big loops.
+    if (NumberOfOptimizedInstructions > UINT_MAX / 100)
+      NumberOfOptimizedInstructions = 0;
   }
----------------
chandlerc wrote:
> I would handle all of this below where you're actually dealing with percentages. Handling it here seems really surprising and hard to understand.
Fixed!

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:832-842
@@ -800,10 +831,13 @@
   if (TripCount && Count == TripCount) {
-    if (Threshold != NoThreshold && UnrolledSize > Threshold) {
-      DEBUG(dbgs() << "  Too large to fully unroll with count: " << Count
-                   << " because size: " << UnrolledSize << ">" << Threshold
-                   << "\n");
-      Unrolling = Partial;
-    } else {
+    UnrollAnalyzer UA(L, TripCount, *SE, TTI, AbsoluteThreshold);
+    UA.analyzeLoop();
+    if (canUnrollCompletely(
+            L, Threshold, AbsoluteThreshold,
+            std::min<unsigned>(UnrolledSize, UA.UnrolledLoopSize),
+            UA.NumberOfOptimizedInstructions,
+            PercentOfOptimizedForCompleteUnroll)) {
       Unrolling = Full;
+    } else {
+      Unrolling = Partial;
     }
   } else if (TripCount && Count < TripCount) {
----------------
chandlerc wrote:
> I may just be forgetting where this is handled, but do we somewhere short-circuit the case where the total size of the loop body * the trip count is already below the threshold? Because we should. We shouldn't go and do the expensive analysis unless we at least have a large enough loop to be on the fence.
Good point, fixed!

http://reviews.llvm.org/D8816

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/