[PATCH] D21405: [PGO] IRPGO pre-cleanup pass changes
Easwaran Raman via llvm-commits
llvm-commits at lists.llvm.org
Fri Jun 17 17:44:28 PDT 2016
eraman added a subscriber: eraman.
eraman added a comment.
In http://reviews.llvm.org/D21405#459392, @xur wrote:
> to vsk:
> I did some analysis on the slow down on bzip2: with preinstrumentation inliner we actually are more aggressive on the late simple inline. Here is the related call chain.
> BZ2_compressBlock() -> sendMTFValues() --> bsW()
> BZ2_compressBlock calls sendMTFValues() one time (1 call site), and
> sendMTFValues() has 64 call sites to bsW().
> In preinline, we inlines sendMTFValues() to BZ2_compressBlock().
> In simple inline, we inlines all 64 calls to bsW() to BZ2_compressBlock().
> Without preininline, we inline 2 calls to bsW in sendMTFValues() and then decided to defers the inline to the other calls bsW(). But somehow we do not inline sendMTFValues() to BZ2_compressBlock().
> I'm yet to investigate why deferred decision changed in simple inliner. I think this is a rare case that we happen to hit.
Rong asked me to look into this. There is a bug in the deferral logic and if it is fixed the default inliner will also result in a code size increase (and possibly performance regression).
A brief desciption of the deferral logic: When we inline a B->C callsite, and B has local linkage, we look at all callers of B (say A_i). If the cost of B->C inlining exceeds the delta (threshold - cost) of A->B inlining, it checks if the overall cost of B->C inlining plus all A_i->B inlining is less than B->C inlining and if it is true, the inlining is deferred. The idea is that delaying B->C inlining will allow B to be inlined into all its callers and the out-of-line body of B can be removed and subsequently C will be inlined into A, resulting in overall cost (proxy for code size) reduction. To account for the fact that the body of B could be removed, a negative cost (-15000) is applied.
Now, in this case (A: BZ2_compressBlock(), B: sendMTFValues() C:bsW()) , there is only one caller of B ( A). When A->B inline cost is computed, the cost analysis also applies the -15000 cost. In other words, the deferral logic under-estimate the cost of A->B inlining by 15000 and defer (because the cost of A->B + B->C is less than the cost of B->C after this under-estimation) most B->C callsites are deferred. But when we consider A->B inlining, the cost becomes higher than the threshold (since we don't apply the -15000 cost twice) and once that fails, the C nodes do not get inlined.
The fix is simple - apply the negative cost correctly - but that will result in all B->C callsites being inlined (and no inlining of A->B) callsite resulting in code size regression.
More information about the llvm-commits