[PATCH] D64376: [MBP] Avoid tail duplication if it can't bring benefit

Guozhi Wei via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Jul 18 10:40:40 PDT 2019


Carrot marked 2 inline comments as done.
Carrot added a comment.

In D64376#1581825 <https://reviews.llvm.org/D64376#1581825>, @xbolva00 wrote:

> What about CTMark? SPEC?


I ran spec2006int w/o and w/ this patch, the result is 38.7 vs 38.6. So basically no change.



================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:1143
+
+  // If the duplication candidate has more unplaced predecessors than
+  // successors, the extra duplication can't bring more fallthrough.
----------------
davidxl wrote:
> Carrot wrote:
> > davidxl wrote:
> > > This does not seem right.   Dup was Pred1's fall through successor and Succ1 was Dup's fall through (without tailDup). Also assume Pred2 and Pred3 has one single successor (otherwise Dup can not be tailDup ed into them).
> > > 
> > > Here, we should look at reduced taken branches, not total fallthrough --- because once Dup is tail duped into Pred2 and Pred3, it is merged with them thus has an implicit fall through there.
> > > 
> > > Let the incoming frequency to Dup from Pred1,2,3 are F1, F2, and F3, and assume the Probability from Dup to Succ1 is P(S1).  We also assume path insensitivity here in the calculation.
> > > 
> > > Before the tail duplication, the total taken branches are
> > > 
> > > T1 = F2 + F3 + (F1+F2+F3)*(1-P(S1))
> > > 
> > > After taildup, assuming Succ1 is a fall through of Pred1->Dup chain, and Succ2 is a fall through of Pred2,  then the total taken branches is
> > > 
> > > T2 = F1 * (1-(P(S1)) +  F2 * P(S1) + F3
> > > 
> > > so the difference is :
> > > 
> > > T1 - T2 = 2*(1-P(S1)) * F2 + (1-P(S1)) * F3   which is greater than 0.
> > > 
> > Current implementation considers only fallthrough from Dup's predecessor to its successors. It doesn't consider other factors such as taken branches and code size.
> > 
> > If we consider taken branches, in this example:
> >     * Dup is not duplicated into Pred3, total taken branches through Pred3 is 
> >                     Freq3 + taken branches to Succ1 or Succ2
> >     * Dup is duplicated into Pred3, at the end of Pred3 it needs two branches to either Succ1 or Succ2, and the total number of taken branches is 
> >                     Freq3
> > 
> > So duplicated into Pred3 really reduces taken branches. Similarly if Dup has more predecessors with single successor(Dup), tail duplicated Dup into them always reduces taken branches. Unfortunately Dup may have many predecessors, we have multiple cases with dozens of predecessors. Although duplicated Dup reduces taken branches, it also increases code size, reduces the efficiency of i-cache, and regress the performance finally.
> > 
> > So we need better method to model it. I can't see a simple trade off between code size and fallthroughs(taken branches) here, maybe we can apply machine learning here.
> Some heuristics can be added. For instance of if the branch from Dup to Succ1,2 is highly biased, do not tail duplicate. If Pred3 (assuming it is the coldest) block is very cold, do not tail duplicate. Tunings like this can even be extended to 2 predecessor case -- with a good cost model + profile information, we can avoid bloating icache size.
Unfortunately current integrated tail duplication can only duplicate BB to all its predecessors or none. So we can't consider Pred3 only. But I agree tail duplicate BB into part of its predecessors is valuable, it needs significant changes to current tail duplication framework, I plan to work on it in future.


================
Comment at: lib/CodeGen/MachineBlockPlacement.cpp:1138
+  // fallthrough from predecessors from Succ to its successors. We may need
+  // other machanism to handle different cases.
+  if (Succ->succ_size() == 0)
----------------
davidxl wrote:
> Can you explain this check (succ_size() ==0 )  a little more? 
(Succ->succ_size() == 0) means Succ doesn't have successors, so it is the function exit BB. 
This conditional return is to keep its current behavior, because it can impact too many test cases.
In future when we can duplicate BB into part of its predecessors, the exit BB can also be handled better.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64376/new/

https://reviews.llvm.org/D64376





More information about the llvm-commits mailing list