[llvm] r372232 - [SampleFDO] Minimize performance impact when profile-sample-accurate

Mon Sep 23 08:57:56 PDT 2019

On Mon, Sep 23, 2019 at 5:49 AM Hans Wennborg <hans at chromium.org> wrote:

> For us, this appears to be a size regression here with no perf gain,
> and the Chrome/Android binary size team tracks these things carefully.
> How do we motivate this change to them? A 70 KB regression might be
> okay if it buys us something else, but if there is no benefit, why
> should it be accepted?

> It sounds like the change is targeting internal Google benchmarks.
> Perhaps this should be behind a flag which they can use? Or if you
> want to change the default, shouldn't the case be made upstream with
> data?
>
>
I think it is reasonable to put a guard and not to enable it for Os and Oz,
but sounds like the size regression is in pieces built with O2/O3. From my
understanding for O2 and O3, usually code size is less of a concern
compared with performance. If code size is a big concern, maybe it should
use Os/Oz instead of O2/O3 because I think it is common for people to trade
off optimal performance against size in O2/O3, inlining is a typical case.

About performance data, it fixed a 1% performance regression in an
important internal benchmark, and we thought the problem was a general one
and could potentially exist in other places as well.

> Reading the change description, it sounds like it's trying to work
> around bad profile data:
>
> "when ProfileSampleAccurate is true, function without sample will be
> regarded as cold and this could potentially cause performance
> regression."
>
> What if the function has no sample because it really is cold though?
> Inlining cold functions sounds like something we don't want. Would it
> be possible to improve the accuracy of your sample data instead?
>

That is not about the accuracy of sample data.  That is caused by some
limitation in AutoFDO itself -- use profile collected from optimized binary
and annotate it on IR not fully optimized using debug information. There is
a gap in between and debug information in optimized binary cannot be
perfect. Because of it, if performance is a major concern, it is a little
aggressive to treat everything without profile as cold. The patch corrects
it a little bit but still keeps the most memory/code size saving from
profile-sample-accurate.

>
> On Fri, Sep 20, 2019 at 6:13 PM Wei Mi <wmi at google.com> wrote:
> >
> > The size increase is expected. We are planning to use
> profile-sample-accurate more widely in Google, but simply enabling it will
> cause performance regression so we use this patch to minimize it. The size
> increase is mostly from more inlining, but the increase is small compared
> with code size reduction from enabling profile-sample-accurate from our
> experiment.
> >
> > If code size is more of a concern for chrome on Android than
> performance, I can guard the change not to be enabled for -Os and Oz (I
> guess chrome on Android has already been built with Os or Oz).
> >
> > Please let me know what you think.
> >
> > Thanks,
> > Wei.
> >
> > On Fri, Sep 20, 2019 at 4:14 AM Hans Wennborg <hans at chromium.org> wrote:
> >>
> >> This caused a ~70 KB size regression for Chrome on Android. Is this
> expected?
> >>
> >> Also +gbiv for Chrome Android pgo.
> >>
> >> We're tracking this in
> >> https://bugs.chromium.org/p/chromium/issues/detail?id=1006159
> >>
> >> On Wed, Sep 18, 2019 at 6:05 PM Wei Mi via llvm-commits
> >> <llvm-commits at lists.llvm.org> wrote:
> >> >
> >> > Author: wmi
> >> > Date: Wed Sep 18 09:06:28 2019
> >> > New Revision: 372232
> >> >
> >> > URL: http://llvm.org/viewvc/llvm-project?rev=372232&view=rev
> >> > Log:
> >> > [SampleFDO] Minimize performance impact when profile-sample-accurate
> >> > is enabled.
> >> >
> >> > We can save memory and reduce binary size significantly by enabling
> >> > ProfileSampleAccurate. However when ProfileSampleAccurate is true,
> >> > function without sample will be regarded as cold and this could
> >> > potentially cause performance regression.
> >> >
> >> > To minimize the potential negative performance impact, we want to be
> >> > a little conservative here saying if a function shows up in the
> profile,
> >> > no matter as outline instance, inline instance or call targets, treat
> >> > the function as not being cold. This will handle the cases such as
> most
> >> > callsites of a function are inlined in sampled binary (thus outline
> copy
> >> > don't get any sample) but not inlined in current build (because of
> source
> >> > code drift, imprecise debug information, or the callsites are all cold
> >> > individually but not cold accumulatively...), so that the outline
> function
> >> > showing up as cold in sampled binary will actually not be cold after
> current
> >> > build. After the change, such function will be treated as not cold
> even
> >> > profile-sample-accurate is enabled.
> >> >
> >> > At the same time we lower the hot criteria of callsiteIsHot check when
> >> > profile-sample-accurate is enabled. callsiteIsHot is used to
> determined
> >> > whether a callsite is hot and qualified for early inlining. When
> >> > profile-sample-accurate is enabled, functions without profile will be
> >> > regarded as cold and much less inlining will happen in CGSCC inlining
> pass,
> >> > so we can worry less about size increase and be aggressive to allow
> more
> >> > early inlining to happen for warm callsites and it is helpful for
> performance
> >> > overall.
> >> >
> >> > Differential Revision: https://reviews.llvm.org/D67561
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190923/0e9302c2/attachment.html>