[PATCH] D64376: [MBP] Avoid tail duplication if it can't bring benefit

Mon Jul 15 09:47:17 PDT 2019

Carrot added inline comments.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:1123
         continue;
-      return false;
+      Duplicate = false;
+      continue;
----------------
davidxl wrote:
> Early return is still valid here, right?
Because of the following (Succ->succ_size() == 0) check, early return is not valid. It is the exact reason of introducing the variable Duplicate here.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:1137
+  // other machanism to handle different cases.
+  if (Succ->succ_size() == 0)
+    return true;
----------------
davidxl wrote:
> Move the check earlier than the loop.
Because of the previous NumDup check, move this return earlier can cause different result.

================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:1143
+
+  // If the duplication candidate has more unplaced predecessors than
+  // successors, the extra duplication can't bring more fallthrough.
----------------
davidxl wrote:
> This does not seem right.   Dup was Pred1's fall through successor and Succ1 was Dup's fall through (without tailDup). Also assume Pred2 and Pred3 has one single successor (otherwise Dup can not be tailDup ed into them).
> 
> Here, we should look at reduced taken branches, not total fallthrough --- because once Dup is tail duped into Pred2 and Pred3, it is merged with them thus has an implicit fall through there.
> 
> Let the incoming frequency to Dup from Pred1,2,3 are F1, F2, and F3, and assume the Probability from Dup to Succ1 is P(S1).  We also assume path insensitivity here in the calculation.
> 
> Before the tail duplication, the total taken branches are
> 
> T1 = F2 + F3 + (F1+F2+F3)*(1-P(S1))
> 
> After taildup, assuming Succ1 is a fall through of Pred1->Dup chain, and Succ2 is a fall through of Pred2,  then the total taken branches is
> 
> T2 = F1 * (1-(P(S1)) +  F2 * P(S1) + F3
> 
> so the difference is :
> 
> T1 - T2 = 2*(1-P(S1)) * F2 + (1-P(S1)) * F3   which is greater than 0.
> 
Current implementation considers only fallthrough from Dup's predecessor to its successors. It doesn't consider other factors such as taken branches and code size.

If we consider taken branches, in this example:
    * Dup is not duplicated into Pred3, total taken branches through Pred3 is 
                    Freq3 + taken branches to Succ1 or Succ2
    * Dup is duplicated into Pred3, at the end of Pred3 it needs two branches to either Succ1 or Succ2, and the total number of taken branches is 
                    Freq3

So duplicated into Pred3 really reduces taken branches. Similarly if Dup has more predecessors with single successor(Dup), tail duplicated Dup into them always reduces taken branches. Unfortunately Dup may have many predecessors, we have multiple cases with dozens of predecessors. Although duplicated Dup reduces taken branches, it also increases code size, reduces the efficiency of i-cache, and regress the performance finally.

So we need better method to model it. I can't see a simple trade off between code size and fallthroughs(taken branches) here, maybe we can apply machine learning here.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64376/new/

https://reviews.llvm.org/D64376