[PATCH] Reimplement heuristic for estimating complete-unroll optimization effects.

Mon May 11 15:28:57 PDT 2015

Whew! Back to this at last. Sorry for the huge delay.

Lots of comments below, but I've marked some as good candidates for follow-up patches.

Can you let me know if it makes sense for me to take a look at the DCE stuff or if we should focus on getting this one landed first?


================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:277
@@ -274,3 +276,3 @@
 
   bool follow(const SCEV *S) {
     if (const SCEVUnknown *SC = dyn_cast<SCEVUnknown>(S)) {
----------------
Please add a comment to this function explaining what its trying to do.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:333-334
@@ -327,1 +332,4 @@
 class UnrollAnalyzer : public InstVisitor<UnrollAnalyzer, bool> {
+  typedef SetVector<BasicBlock *, SmallVector<BasicBlock *, 16>,
+                    SmallPtrSet<BasicBlock *, 16>> BBSetVector;
+
----------------
Rather than all of this, you can just use a SmallSetVector<BasicBlock *, 16>. I would prefer this somewhat rather than the typedef...

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:339
@@ -330,1 +338,3 @@
 
+  typedef struct {
+    Value *BaseAddr;
----------------
Just use a named inner struct -- typedef-ing structs is only necessary in C modes.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:341-342
@@ +340,4 @@
+    Value *BaseAddr;
+    uint64_t Start;
+    uint64_t Step;
+  } SCEVGEPDescriptor;
----------------
While 64-bits should be enough for any common cases, if the SCEV code has APInts I would just continue to use them here so that we have full fidelity.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:377-384
@@ -342,12 +376,10 @@
 
   // Provide base case for our instruction visit.
   bool visitInstruction(Instruction &I) { return false; };
-  // TODO: We should also visit ICmp, FCmp, GetElementPtr, Trunc, ZExt, SExt,
-  // FPTrunc, FPExt, FPToUI, FPToSI, UIToFP, SIToFP, BitCast, Select,
-  // ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue.
+  // TODO: Add visitors for other instruction types, e.g. ZExt, SExt.
   //
   // Probaly it's worth to hoist the code for estimating the simplifications
   // effects to a separate class, since we have a very similar code in
   // InlineCost already.
   bool visitBinaryOperator(BinaryOperator &I) {
     Value *LHS = I.getOperand(0), *RHS = I.getOperand(1);
----------------
The spacing and mixture of doxygen style and non-doxygen style seems really messy here.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:419
@@ +418,3 @@
+      return false;
+    SCEVGEPDescriptor d = It->second;
+
----------------
'd' isn't a very helpful name (aside from not matching the variable naming conventions).

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:421-423
@@ -388,1 +420,5 @@
+
+    auto GV = dyn_cast<GlobalVariable>(d.BaseAddr);
+    if (!GV || !GV->hasInitializer())
+      return false;
 
----------------
Comment here to remind the reader that you're checking for the specific types of SCEVGEP loads that can be folded completely to a constant.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:432-433
@@ -419,3 +431,4 @@
+    uint64_t Index = (d.Start + d.Step * Iteration) / ElemSize;
     if (Index >= CDS->getNumElements())
-      return nullptr;
+      return false;
 
----------------
We at least need a comment or FIXME here as we shouldn't return false here. A load past the end of sequential constant data is an error, and so we should be free to fold it to nothing for the purpose of loop unroll cost estimation.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:450
@@ -438,2 +449,3 @@
     for (auto BB : L->getBlocks()) {
       for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {
+        if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(I)) {
----------------
Why not a range based loop here?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:454-455
@@ -445,2 +453,4 @@
+          const SCEV *S = SE.getSCEV(V);
           FindConstantPointers Visitor(L, SE);
           SCEVTraversal<FindConstantPointers> T(Visitor);
+          // Try to find (BaseAddress+Step+Offset) tuple.
----------------
It feels like we could probably hoist some of this out of the loop? Feel free to just leave a FIXME and not deal with it in this patch.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:463
@@ +462,3 @@
+
+          SCEVGEPDescriptor d;
+          d.BaseAddr = Visitor.BaseAddress;
----------------
Again, 'd' is a bad name here.

Actually, I don't know why you create the descriptor this early? It seems like this region of code could just use the base addr from the visitor.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:480
@@ +479,3 @@
+          APInt StartAP = StartSE->getValue()->getValue();
+          if (StartAP.getActiveBits() > 32 || StepAP.getActiveBits() > 32)
+            continue;
----------------
Is this to prevent overflow of the offset computations later? If so, comment that please. If not, what is it for?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:483-486
@@ -451,1 +482,6 @@
+
+          d.Start = StartAP.getLimitedValue();
+          d.Step = StepAP.getLimitedValue();
+
+          SCEVCache[V] = d;
         }
----------------
I feel like this could just be an insert call? Or if you'd rather, something like:

  SCEVCache[V] = {Visitor.BaseAddress, StartAP.getLimitedValue(),
                  StepAP.getLimitedValue()};

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:504-511
@@ +503,10 @@
+
+  // Complete loop unrolling can make some loads constant, and we need to know
+  // if that would expose any further optimization opportunities.  This routine
+  // estimates this optimization.  It assigns computed number of instructions,
+  // that potentially might be optimized away, to NumberOfOptimizedInstructions,
+  // and total number of instructions to UnrolledLoopSize (not counting blocks
+  // that won't be reached, if we were able to compute the condition).
+  // Return false if we can't analyze the loop, or if we discovered that
+  // unrolling won't give anything. Otherwise, return true.
+  bool analyzeLoop() {
----------------
I feel like this should probably be a doxygen comment.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:514-515
@@ -483,1 +513,4 @@
+    BBSetVector BBWorklist;
+    UnrolledLoopSize = 0;
+    NumberOfOptimizedInstructions = 0;
 
----------------
Why do we zero these here rather than in the constructor?

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:537-538
@@ +536,4 @@
+        BasicBlock *BB = BBWorklist[Idx];
+        if (BB->empty())
+          continue;
+
----------------
This seems vacuous due to the requirement of a terminator...

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:544-549
@@ +543,8 @@
+        for (Instruction &I : *BB) {
+          UnrolledLoopSize += TTI.getUserCost(&I);
+          Base::visit(I);
+          // If unrolled body turns out to be too big, bail out.
+          if (UnrolledLoopSize - NumberOfOptimizedInstructions >
+              MaxUnrolledLoopSize)
+            return false;
+        }
----------------
Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:687-691
@@ +686,7 @@
+
+  // If we can overflow computing percentage of optimized instructions, just
+  // give a conservative answer. Anyway, we don't want to deal with such a
+  // big loops.
+  if (NumberOfOptimizedInstructions > UINT_MAX / 100)
+    NumberOfOptimizedInstructions = 0;
+
----------------
I would find it much more clear to just write the "percentage" check below in a way that wouldn't overflow:

  uint64_t PercentOfOptimizedInstructions =
      (uint64_t)NumberOfOptimizedINstructions * 100ull / UnrolledSize;

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:695
@@ +694,3 @@
+      NumberOfOptimizedInstructions * 100 /
+      UnrolledSize; // The previous check guards us from div by 0
+  if (UnrolledSize <= AbsoluteThreshold &&
----------------
I don't think the comment here is helping. I would just add an assert about it above, after the test.

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:848
@@ +847,3 @@
+              L, Threshold, AbsoluteThreshold,
+              std::min<uint64_t>(UnrolledSize, UA.UnrolledLoopSize),
+              UA.NumberOfOptimizedInstructions,
----------------
Why do you need this? I'm surprised this isn't just directly using the UA's values?

http://reviews.llvm.org/D8816

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/