[PATCH] Enhance loop rotation with existence of profile data in MachineBlockPlacement pass.

Mon Jun 29 10:53:47 PDT 2015

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:818
@@ +817,3 @@
+  double HeaderFallThruFreq = 0;
+  for (auto *Pred : HeaderBB->predecessors()) {
+    BlockChain *PredChain = BlockToChain[Pred];
----------------
davidxl wrote:
> congh wrote:
> > davidxl wrote:
> > > Why is this loop outside of the following loop? I expect  each new candidate loop header to have a chance to check the fall through opportunity and related saving/cost.
> > Here we only consider natural loops, which have only one entry (loop header). As there is only one loop header, we only need to compute the fall-through once. We only check this benefit if the loop top is the loop header.
> Ok -- a comment there about that will be useful.
OK. Thanks!

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:821
@@ +820,3 @@
+    if (!LoopBlockSet.count(Pred) &&
+        (!PredChain || Pred == *std::prev(PredChain->end()))) {
+      double EdgeFreq = static_cast<double>(
----------------
davidxl wrote:
> congh wrote:
> > davidxl wrote:
> > > If the previous chain does not end up connecting with this new top block in the current loop chain, then the benefit of the fall through won't be materializable, for instance when the Pred has more than 1 successors and the other successor is also a chain lead.
> > That is right. But at this point we haven't started to build chains outside of this loop. It is difficult to tell if Pred will or will not be connected to the loop header. This is just a heuristic-like optimization. We could not guarantee global optimal result.
> Is it worthwhile considering doing loop rotation after all chains have been formed?
If all chains are already formed, loop rotation may not be done as BBs outside of the loop are already merged into chains. For example, BBs just after loops' exits may not be chain tops. I thinks LLVM's bottom-up method makes sense here.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:828
@@ +827,3 @@
+      if (Pred->succ_size() == 1)
+        EdgeFreq *= 2;
+      HeaderFallThruFreq = std::max(HeaderFallThruFreq, EdgeFreq);
----------------
davidxl wrote:
> congh wrote:
> > davidxl wrote:
> > > I think the adjustment should be done the other way:
> > > 
> > > when num of successors > 1, the cost will be adjust to the minimal freq of the Pred's outgoing edge -- that is where the unconditional jump will be inserted (assuming the fall through path is the hottest one)
> > Consider a two-way branch from Pred with frequencies x and y, and x is the frequency from Pred to the loop header. When x > y, and the edge is not fall-thru, then the number of jumps is x + y + y, and branch cost is x + y; if it is a fall-thru, the number of jumps is x + y, and the branch cost is y. If we take the weight of branch cost and jump instruction to be the same, then we can save x+y by making that edge fall-thru.
> > 
> > If x < y, the edge from Pred to the loop header won't be fall-thru and we save nothing.
> > 
> > So the conclusion is: if num of successors > 1 and the edge from Pred to the loop header is the most frequent one, then making this edge fall-thru can save us the freq of Pred. If that edge is not the most frequent one, we could save nothing. If num of successors == 1, we can save 2x freq of Pred.
> case 1:  num_successors (Pred) == 2, and x > y
> 
> if Pred->Header edge is not the fall through, assuming the other edge will become a fall through edge , then the costs will be:
>   1) number of jumps executed: (x+y), cost of jump is C1
>   2) number of taken branches: x, cost of branch taken is C2:
> Total cost: C1* (x+y) + C2* x
> 
> if Pred->Header edge is the fall through, 
>    1) number of jumps executed: (x+y)
>    2) number of taken branches: y
> Total cost: C1 * (x+y) + C2 * y
> 
> so the saving (of keeping the Pred->header fall through) is C2 * (x - y)
> 
> case 2: number of successors (Pred) == 2, x < y
> 
> The saving of keeping Pred->header to be fall through is still C2 * (x - y ), but negative.
> 
> case 3: number of successors (Pred) == 1
> 
> Keeping the fall through, the total cost is 0, while not keeping the fall through the cost is (C1+C2)*x
> 
> 
> Conclusion
> a) when number of successors > 1, the real cost of not doing the fall through should be adjusted to C2 * (x-y)
> b) when the number of successors == 1, your estimate is correct, assuming C1 == C2
> c) better introduce a cost parameter for type1 and type2
Note that here we focus on loop rotation. In the case of x > y, the other edge with frequency y won't become a fall through edge. If the header is not the chain top, then Pred will chose another loop BB other than the header or the other successor outside of the loop as its successor in the layout. That is why the number of jumps executed is x + y + y (in this case there both of its CFG successors are not fall-thru so there are two jumps).

Following this analysis (as in my last post), we could see that we can save (C1 + C2) * y by making the header as the top of the loop chain.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:855
@@ +854,3 @@
+      if (!LoopBlockSet.count(Succ) &&
+          (!SuccChain || Succ == *SuccChain->begin())) {
+        auto EdgeFreq =
----------------
davidxl wrote:
> congh wrote:
> > davidxl wrote:
> > > Do you also want to check if the Succ block has more than one incoming edges. If it has more than one incoming edges, need to make sure 'Succ' block's  other predecessor is not a tail of another chain that can be connnected with 'Succ'.
> > The function findBestLoopExit() doesn't consider this possibility. If we check this, should we go ahead and check if the source of the more frequent incoming edge has a more frequent outgoing edge... I am afraid this may make things too complicated.
> you need keep an eye on it and consider not doing loop rotation on the fly and chain merging on the fly. The decision of rotation and how chains are merged/connected need to be made together.
In LLVM' current bottom-up algorithm, those two processes are separated by design. Even we delay the loop rotation until forming chains outside of it, if it has several exits, the situation is still complicated (single-entry single-exit loops are much easier to handle as they can be reduced to one CFG node). The loop rotation algorithm is based on the heuristic that the loop exit at the chain bottom are much likely to have a fall-thru edge to the outside.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:870
@@ +869,3 @@
+      // consider the cost of this jump.
+      Cost += (TailBB->succ_size() == 1 ? 2 : 1) *
+              static_cast<double>(FallThruEdgeFreq.getFrequency());
----------------
davidxl wrote:
> This needs more explanation.
Done.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:941
@@ +940,3 @@
+    rotateLoopWithProfile(LoopChain, L, LoopBlockSet);
+  else if (LoopTop == L.getHeader()) {
+    // If we selected just the header for the loop top, look for a potentially
----------------
davidxl wrote:
> congh wrote:
> > davidxl wrote:
> > > the check does not seem necessary. If LoopTop == header, ExitingBB will be null.
> > This is because I changed the position of findBestLoopExit(). As I will resume its position, I will also update this part.
> Another question is: if the current layout algorithm finds a different 'best' loop top than the header, than the profile driven/cost based rotation will be suppressed, which is not desirable.
Actually, if we use the profile based loop rotation algorithm, we could discard the current method to find the  best loop top and exit. This is because our algorithm already considers them.

http://reviews.llvm.org/D10717

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/