[PATCH] D10717: Enhance loop rotation with existence of profile data in MachineBlockPlacement pass.

Sun Jul 26 23:24:44 PDT 2015

congh added inline comments.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:816-827
@@ +815,14 @@
+///
+/// With profile data, we can determine the benefit in terms of fall through
+/// opportunities when rotating a loop chain and select the best rotation.
+/// Basically, there are three kinds of benefit/cost to consider for each
+/// rotation:
+///    1. The fall through edge (if it exists) from BB out of the loop to the
+///    loop header (this is a benefit).
+///    2. The fall through edge (if it exists) from the loop exit to BB out of
+///    the loop (this is a benefit).
+///    3. The fall through edge (if it exists) from the last BB to the first BB
+///    in the loop chain (this is a cost as it is not falling through any more).
+///  Therefore, the benefit for a given rotation is 1 + 2 - 3. We select the
+///  best rotation with the largest benefit.
+void MachineBlockPlacement::rotateLoopWithProfile(
----------------
davidxl wrote:
> chandlerc wrote:
> > I think it would be clearer to talk *only* in terms of cost. Below, when computing it, you can add and subtract costs to find the minimum.
> > 
> > Consider wording along the lines of:
> > 
> > There are three costs we want to minimize stemming from a jump rather than a fall through edge:
> > 1) Entering the loop,
> > 2) Exiting the loop, and
> > 3) Continuing to loop by returning to the header.
> > 
> > Does that make sense? If not, then maybe I'm misunderstanding what you're trying to say more fundamentally.
> More precisely, the objective is to find a rotation with minimal overall cost or equivalently maximal the overall benefit. The cost/benefit function has 3 components.
Yes. This makes more sense. But from the implementation aspect, BlockFrequency doesn't provide subtract operation. This is why I use both benefit and cost. I could scale down all frequencies so that I can use signed 64bit integer to calculate the cost or benefit.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:834-838
@@ +833,7 @@
+
+  // We calculate the benefit and cost of a loop rotation respectively, which
+  // are represented by BlockFrequency. When comparing the benefit/cost between
+  // two different rotations, we represent the actual benefit as benefit - cost,
+  // and check (benefit1 - cost1) > (benefit2 - cost2) from (benefit1 + cost2) >
+  // (benefit2 + cost1). Hopefully there is no overflow.
+  BlockFrequency BestRotationBenefit(0);
----------------
davidxl wrote:
> chandlerc wrote:
> > As above, I think it would be useful to have just cost, and to use the natural addition and subtraction to find the minimum cost rotation.
> > 
> > Also, "hopefully there is no overflow" seems a bad sign. We should *definitely* have no overflow, and I think it makes sense to add asserts and such to enforce that.
> Cong is actually computing the overall benefit and maximize it. 
> 
> Cong, It is equivalent to compute cost only -- simply change it to LoopExternalConnectionCost. For rotation with positive benefit in current patch, the connection cost will be 0, while for rotation with no benefit, the connection cost will be the branch freq to the head (or branch out freq)
The arithmetic operation defined in BlockFrequency already considers overflow, but it may return the saturated result, which may not be what we want. I may have to use the similar way that is used in getSumForBlock() to prevent potential overflow.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:834-838
@@ +833,7 @@
+
+  // We calculate the benefit and cost of a loop rotation respectively, which
+  // are represented by BlockFrequency. When comparing the benefit/cost between
+  // two different rotations, we represent the actual benefit as benefit - cost,
+  // and check (benefit1 - cost1) > (benefit2 - cost2) from (benefit1 + cost2) >
+  // (benefit2 + cost1). Hopefully there is no overflow.
+  BlockFrequency BestRotationBenefit(0);
----------------
congh wrote:
> davidxl wrote:
> > chandlerc wrote:
> > > As above, I think it would be useful to have just cost, and to use the natural addition and subtraction to find the minimum cost rotation.
> > > 
> > > Also, "hopefully there is no overflow" seems a bad sign. We should *definitely* have no overflow, and I think it makes sense to add asserts and such to enforce that.
> > Cong is actually computing the overall benefit and maximize it. 
> > 
> > Cong, It is equivalent to compute cost only -- simply change it to LoopExternalConnectionCost. For rotation with positive benefit in current patch, the connection cost will be 0, while for rotation with no benefit, the connection cost will be the branch freq to the head (or branch out freq)
> The arithmetic operation defined in BlockFrequency already considers overflow, but it may return the saturated result, which may not be what we want. I may have to use the similar way that is used in getSumForBlock() to prevent potential overflow.
David, I am not clear how LoopExternalConnectionCost is working: do we only need this cost, or do we have other cost/benefit? I ask this because there may exist more than one rotation that is beneficial.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:847-850
@@ +846,6 @@
+  auto ScaleBlockFrequency = [](BlockFrequency Freq,
+                                unsigned Scale) -> BlockFrequency {
+    if (Scale == 0)
+      return 0;
+    return Freq / BranchProbability(1, Scale);
+  };
----------------
davidxl wrote:
> I believe I raised similar question before (see comment and replies at line 818). I think the problem is that we can not be very sure if this loop's header will or will not be the layout successor of E due to the loop based layout ordering constraint (In fact in your example, unless the branch is highly biased towards LH, it is likely X will be the layout successor of E).
> 
> That is why I proposed at some point that loop rotation be done when base layout is done for the function -- however that may require too big a change.
> 
> It might worth considering obvious beneficial conditional jump here.
Without knowing the layout outside of the loop, we do not know if the loop header will or will not be the layout-successor of its predecessor outside of the loop. However, this is the case when we do loop rotation (unless we change the design as suggested by David). But if a loop has many iterations, the benefit of the loop header's fall-through would be trivial comparing to the cost from the loop chain's tail to its head (similar assumption for the exit fall-through).  If not, the fall-through to the loop header is very important to the overall cost and we'd better hope that the fall-through will be there in the final layout. That is why I think that we should always take this fall-through into consideration.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:902
@@ +901,3 @@
+        LargestExitEdgeProb = std::max(LargestExitEdgeProb,
+                                       MBPI->getEdgeProbability(TailBB, Succ));
+    }
----------------
davidxl wrote:
> chandlerc wrote:
> > Tragically, this loop is still quadratic in complexity.
> > 
> > getEdgeProbability is really inefficient when queried in this way. You can find other code in this file to compute the best successor probability without hitting this quadratic behavior. We should really figure out an API for this in BPI but for now, I would suggest extracting a helper and using it here.
> yes -- can consider computing TailBB's sum weight outside the loop.
Yes, thanks for pointing it out. I will calculate the sum of weight outside of the loop to avoid redundant calculations.

I will consider to design an API in BPI to help us here.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:912-916
@@ +911,7 @@
+    //   JumpInstCost) * tail node frequency.
+    //   2. If the tail node has two successors, then we still get an additional
+    //   jmp instruction. Note that the more frequent executed jmp instruction
+    //   will be put ahead of the other one. Assume the frequency of those two
+    //   branches are x and y (x >= y), then the cost will be (x * MisfetechCost
+    //   + y * JumpInstCost) * tail node frequency.
+    //   3. If the tail node has more than two successors (this rarely happens),
----------------
davidxl wrote:
> chandlerc wrote:
> > What if one of the successors is outside the loop?
> > 
> > (Also, please add a test case covering this?)
> this should not matter.
That would be an exit and the exit fall-through is also considered here. I will add a test case about it (maybe a loop with several exits with different frequencies).

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:915
@@ +914,3 @@
+    //   will be put ahead of the other one. Assume the frequency of those two
+    //   branches are x and y (x >= y), then the cost will be (x * MisfetechCost
+    //   + y * JumpInstCost) * tail node frequency.
----------------
davidxl wrote:
> I don't think this formula is correct. The correct formula is
> 
> (x-x')*MisfetchCost + y*JumpInstrCost
> 
> Where x' is the frequency of the edge from the tail node to the block which is not its layout successor (before the rotation).
We don't have to compare each rotation with the current layout before rotation. We just calculate its own cost. Your formula is equivalent to the current one: you subtract x' * MisfetchCost from every rotation (so the layout before rotation is x' - x' = 0 here).

http://reviews.llvm.org/D10717