[PATCH] D57805: [HotColdSplit] Move splitting after instrumented PGO use

Mon Feb 11 12:17:02 PST 2019

tejohnson added a comment.

In D57805#1393416 <https://reviews.llvm.org/D57805#1393416>, @vsk wrote:

> @tejohnson have you had a chance to evaluate performance with IR-PGO + splitting enabled?

I have one data point, more below.

> Our internal CI shows performance regressions on SPEC/CINT2000 with FE-PGO + splitting enabled. Allowing inlining of split functions reduces the perf regression,

This is controlled by the MinSize attribute, right?

> and moving splitting after inlining eliminates it.
> 
> It seems important to inline and optimize certain basic blocks which FE PGO data marks cold. We may need to address this by changing the FE-PGO instrumentation, or by ignoring ProfileSummaryInfo when it's based on a FE profile.
> 
> What's interesting about this is that I didn't see a perf regression on SPEC/CINT2000 with IR-PGO + splitting. I'd be curious to know if your testing bears this out.

I tried for one important internal app that we build with IR-PGO and ThinLTO (late last week with this patch and r353434). Unfortunately it is degrading around 1%. I verified that if I remove the part that marks existing cold functions (i.e those that don't get split) with the MinSize attribute that this reduces the degradation to around 0.5%. If I prevent marking the new split cold functions with MinSize it possibly gets a bit better (degradation only around 0.4%). When I had tried splitting awhile back when it was in the original position after inlining (in the ThinLTO backends) I got around neutral performance. I was hoping for some improvement based on experiments we had done back with gcc's function splitting (-freorder-blocks-and-partition).

I did do some profiling to compare function profiles with and without splitting enabled. I see only one case where we are spending any time in a cold split function (i.e. where the profile presumably wasn't accurate), but I don't think this is causing most of the difference. It looks like there are very different inlines (expected), but these might be causing a degradation for some reason. I will try moving the splitting to after the ThinLTO backend (post-thinlink) inlining and see what effect there is. Theoretically we should be getting more accurate importing/inlining, it would be good to understand where this is going wrong if not!

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57805/new/

https://reviews.llvm.org/D57805