[PATCH] D98213: [InlineCost] Enable the cost benefit analysis on FDO

Thu Mar 11 14:49:39 PST 2021

wenlei added a comment.

In D98213#2620263 <https://reviews.llvm.org/D98213#2620263>, @kazu wrote:

> In D98213#2612503 <https://reviews.llvm.org/D98213#2612503>, @wenlei wrote:
>
>> What's the .text size change for your internal benchmark when you turn it on?
>
> For FDO, the size of .text.* increases by 0.03%.  .text.hot and .text shrink by 0.67% by 0.24%, respectively.  The reduction is mostly offset by a 1.25% increase in .text.split.
>
> For CSFDO, the size of .text.* increases by 0.07%.  .text.hot, .text, and .text.split grow by 1.10%, 0.40%, and 2.05%, respectively.
>
> Under the hood, we are inlining very hot functions that were previously too big to be inlined while rejecting marginally hot call sites that were within the size threshold.  The former primarily contributes to the execution performance.  The latter contributes to size reduction with little impact on the execution performance.
>
> Note that you can set `-inline-savings-multiplier` to some positive integer smaller than 8, say 4, to limit inlining of the hot call sites and thus reduce the executable size.  My experiments with our large internal benchmark show that the performance doesn't change much when I tweak -inline-savings-multiplier.  If we assume that the ratio of cycle savings to size cost is a reasonable indicator of inlining desirability, then applying a threshold on the ratio means that we always inline the most desirable call sites, with the threshold controlling the amount of the long tail to reject, which I think is the reason why the parameter doesn't affect the performance very much.

Thanks for the size data, it seems well justified by the perf boost. I think the cycle saving estimation using profile makes a lot of sense and makes LLVM's PGO closer to other compilers.

> In D98213#2612503 <https://reviews.llvm.org/D98213#2612503>, @wenlei wrote:
>
>> What is the impact of this switch with sample profile?
>
> About neutral without ThinLTO and a 0.27% improvement on our large internal benchmark with ThinLTO.  This is why I am turning on the switch by default on FDO profile for now but not on sample profile.  The sample profile loader has its own inliner.  If we want to benefit from the idea of looking at the ratio of cycle savings to size cost, then I think we really have to modify the inliner in the sample profile loader.

The much smaller perf boost from sample PGO is interesting. I'm curious how much of that is due to sample loader inliner vs just inferior profile quality. I changed sample loader's inliner to also look at the same InlineCost a while ago, except the threshold part is determined locally inside sample loader. I think we just need to honor the decision when it's from the benefit analysis there. I may give it a try and see what we get on our workloads.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98213/new/

https://reviews.llvm.org/D98213