[PATCH] D57805: [HotColdSplit] Move splitting after instrumented PGO use

Mon Feb 11 12:37:21 PST 2019

vsk added a comment.

In D57805#1393516 <https://reviews.llvm.org/D57805#1393516>, @tejohnson wrote:

> In D57805#1393416 <https://reviews.llvm.org/D57805#1393416>, @vsk wrote:
>
> > @tejohnson have you had a chance to evaluate performance with IR-PGO + splitting enabled?
>
>
> I have one data point, more below.
>
> > Our internal CI shows performance regressions on SPEC/CINT2000 with FE-PGO + splitting enabled. Allowing inlining of split functions reduces the perf regression,
>
> This is controlled by the MinSize attribute, right?

Partially, yes, I think MinSize affects inlining thresholds. It's also controlled by the 'noinline' attribute. To tweak this, you can disable CI->setIsNoInline() in extractColdFunction.

>> and moving splitting after inlining eliminates it.
>> 
>> It seems important to inline and optimize certain basic blocks which FE PGO data marks cold. We may need to address this by changing the FE-PGO instrumentation, or by ignoring ProfileSummaryInfo when it's based on a FE profile.
>> 
>> What's interesting about this is that I didn't see a perf regression on SPEC/CINT2000 with IR-PGO + splitting. I'd be curious to know if your testing bears this out.
> 
> I tried for one important internal app that we build with IR-PGO and ThinLTO (late last week with this patch and r353434). Unfortunately it is degrading around 1%. I verified that if I remove the part that marks existing cold functions (i.e those that don't get split) with the MinSize attribute that this reduces the degradation to around 0.5%. If I prevent marking the new split cold functions with MinSize it possibly gets a bit better (degradation only around 0.4%).

I see, it sounds like the perf regression with PGO may not be specific to FE-PGO.

> When I had tried splitting awhile back when it was in the original position after inlining (in the ThinLTO backends) I got around neutral performance. I was hoping for some improvement based on experiments we had done back with gcc's function splitting (-freorder-blocks-and-partition).
> 
> I did do some profiling to compare function profiles with and without splitting enabled. I see only one case where we are spending any time in a cold split function (i.e. where the profile presumably wasn't accurate), but I don't think this is causing most of the difference. It looks like there are very different inlines (expected), but these might be causing a degradation for some reason. I will try moving the splitting to after the ThinLTO backend (post-thinlink) inlining and see what effect there is. Theoretically we should be getting more accurate importing/inlining, it would be good to understand where this is going wrong if not!

Thanks, this would be a really useful experiment.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57805/new/

https://reviews.llvm.org/D57805