[PATCH] D21405: [PGO] IRPGO pre-cleanup pass changes

Fri Jun 17 17:44:28 PDT 2016

eraman added a subscriber: eraman.
eraman added a comment.

In http://reviews.llvm.org/D21405#459392, @xur wrote:

> to vsk:
>
> I did some analysis on the slow down on bzip2: with preinstrumentation inliner we actually are more aggressive on the late simple inline. Here is the related call chain.
>  BZ2_compressBlock() -> sendMTFValues() --> bsW()
>  BZ2_compressBlock calls sendMTFValues() one time (1 call site), and
>  sendMTFValues() has 64 call sites to bsW().
>
> In preinline, we inlines sendMTFValues() to BZ2_compressBlock().
>  In simple inline, we inlines all 64 calls to bsW() to BZ2_compressBlock().
>
> Without preininline, we inline 2 calls to bsW in sendMTFValues() and then decided to defers the inline to the other calls bsW(). But somehow we do not inline sendMTFValues() to BZ2_compressBlock().
>
> I'm yet to investigate why deferred decision changed in simple inliner. I think this is a rare case that we happen to hit.

Rong asked me to look into this. There is a bug in the deferral logic and if it is fixed the default inliner will also result in a code size increase (and possibly performance regression). 
A brief desciption of the deferral logic: When we inline a B->C callsite, and B has local linkage, we look at all  callers of B (say A_i).  If the cost of B->C inlining exceeds the delta (threshold - cost) of A->B inlining, it checks if the overall cost of B->C inlining plus all A_i->B inlining is less than B->C inlining and if it is true, the inlining is deferred. The idea is that delaying B->C inlining will allow B to be inlined into all its callers and the out-of-line body of B can be removed and subsequently C will be inlined into A, resulting in overall cost (proxy for code size) reduction. To account for the fact that the body of B could be removed, a negative cost  (-15000) is applied.

Now, in this case (A: BZ2_compressBlock(), B: sendMTFValues() C:bsW()) , there is only one caller of B ( A). When A->B inline cost is computed, the cost analysis also applies the -15000 cost. In other words, the deferral logic under-estimate the cost of A->B inlining by 15000 and defer (because the cost of A->B + B->C is less than the cost of B->C after this under-estimation) most  B->C callsites are deferred. But when we consider A->B inlining, the cost becomes higher than the threshold (since we don't apply the -15000 cost twice) and once that fails, the C  nodes do not get inlined.

The fix is simple - apply the negative cost correctly - but that will result in all B->C callsites being inlined (and no inlining of A->B) callsite resulting in code size regression.

http://reviews.llvm.org/D21405