[llvm] r372232 - [SampleFDO] Minimize performance impact when profile-sample-accurate

Wed Sep 25 12:08:03 PDT 2019

Hans, thanks for checking. I will try another way of fixing it.

On Wed, Sep 25, 2019 at 5:24 AM Hans Wennborg <hans at chromium.org> wrote:

>  On Tue, Sep 24, 2019 at 4:37 PM Wei Mi <wmi at google.com> wrote:
> >
> > I committed another change trying to remove the most size increase from
> this change at rL372665. Please see if it can fix the size regression.
>
> I tried it on the reproducer at
> https://bugs.chromium.org/p/chromium/issues/detail?id=1006159#c1
> (attachments at #c2), but it doesn't seem to have any effect: the
> binary size is identical before and after r372665.
>
> > On Tue, Sep 24, 2019 at 1:16 AM Hans Wennborg <hans at chromium.org> wrote:
> >>
> >> On Mon, Sep 23, 2019 at 5:58 PM Wei Mi <wmi at google.com> wrote:
> >> >
> >> >
> >> >
> >> > On Mon, Sep 23, 2019 at 5:49 AM Hans Wennborg <hans at chromium.org>
> wrote:
> >> >>
> >> >> For us, this appears to be a size regression here with no perf gain,
> >> >> and the Chrome/Android binary size team tracks these things
> carefully.
> >> >> How do we motivate this change to them? A 70 KB regression might be
> >> >> okay if it buys us something else, but if there is no benefit, why
> >> >> should it be accepted?
> >> >>
> >> >>
> >> >> It sounds like the change is targeting internal Google benchmarks.
> >> >> Perhaps this should be behind a flag which they can use? Or if you
> >> >> want to change the default, shouldn't the case be made upstream with
> >> >> data?
> >> >>
> >> >
> >> > I think it is reasonable to put a guard and not to enable it for Os
> and Oz, but sounds like the size regression is in pieces built with O2/O3.
> From my understanding for O2 and O3, usually code size is less of a concern
> compared with performance. If code size is a big concern, maybe it should
> use Os/Oz instead of O2/O3 because I think it is common for people to trade
> off optimal performance against size in O2/O3, inlining is a typical case.
> >>
> >> We build different parts of Chrome with different flags. For this
> >> code, we use -O2/-O3 because we do value performance over size, but in
> >> this case it's not clear that there is a performance benefit.
> >
> >
> > Ok.
> >
> >>
> >>
> >> > About performance data, it fixed a 1% performance regression in an
> important internal benchmark, and we thought the problem was a general one
> and could potentially exist in other places as well.
> >>
> >> Is there any data showing that this is beneficial beyond the 1% on
> >> this particular benchmark?
> >
> >
> >  No, we havn't done wide evaluation for it.
> >
> >>
> >>
> >>
> >>
> >> >> Reading the change description, it sounds like it's trying to work
> >> >> around bad profile data:
> >> >>
> >> >> "when ProfileSampleAccurate is true, function without sample will be
> >> >> regarded as cold and this could potentially cause performance
> >> >> regression."
> >> >>
> >> >> What if the function has no sample because it really is cold though?
> >> >> Inlining cold functions sounds like something we don't want. Would it
> >> >> be possible to improve the accuracy of your sample data instead?
> >> >
> >> >
> >> > That is not about the accuracy of sample data.  That is caused by
> some limitation in AutoFDO itself -- use profile collected from optimized
> binary and annotate it on IR not fully optimized using debug information.
> There is a gap in between and debug information in optimized binary cannot
> be perfect. Because of it, if performance is a major concern, it is a
> little aggressive to treat everything without profile as cold. The patch
> corrects it a little bit but still keeps the most memory/code size saving
> from profile-sample-accurate.
> >> >
> >> >>
> >> >>
> >> >> On Fri, Sep 20, 2019 at 6:13 PM Wei Mi <wmi at google.com> wrote:
> >> >> >
> >> >> > The size increase is expected. We are planning to use
> profile-sample-accurate more widely in Google, but simply enabling it will
> cause performance regression so we use this patch to minimize it. The size
> increase is mostly from more inlining, but the increase is small compared
> with code size reduction from enabling profile-sample-accurate from our
> experiment.
> >> >> >
> >> >> > If code size is more of a concern for chrome on Android than
> performance, I can guard the change not to be enabled for -Os and Oz (I
> guess chrome on Android has already been built with Os or Oz).
> >> >> >
> >> >> > Please let me know what you think.
> >> >> >
> >> >> > Thanks,
> >> >> > Wei.
> >> >> >
> >> >> > On Fri, Sep 20, 2019 at 4:14 AM Hans Wennborg <hans at chromium.org>
> wrote:
> >> >> >>
> >> >> >> This caused a ~70 KB size regression for Chrome on Android. Is
> this expected?
> >> >> >>
> >> >> >> Also +gbiv for Chrome Android pgo.
> >> >> >>
> >> >> >> We're tracking this in
> >> >> >> https://bugs.chromium.org/p/chromium/issues/detail?id=1006159
> >> >> >>
> >> >> >> On Wed, Sep 18, 2019 at 6:05 PM Wei Mi via llvm-commits
> >> >> >> <llvm-commits at lists.llvm.org> wrote:
> >> >> >> >
> >> >> >> > Author: wmi
> >> >> >> > Date: Wed Sep 18 09:06:28 2019
> >> >> >> > New Revision: 372232
> >> >> >> >
> >> >> >> > URL: http://llvm.org/viewvc/llvm-project?rev=372232&view=rev
> >> >> >> > Log:
> >> >> >> > [SampleFDO] Minimize performance impact when
> profile-sample-accurate
> >> >> >> > is enabled.
> >> >> >> >
> >> >> >> > We can save memory and reduce binary size significantly by
> enabling
> >> >> >> > ProfileSampleAccurate. However when ProfileSampleAccurate is
> true,
> >> >> >> > function without sample will be regarded as cold and this could
> >> >> >> > potentially cause performance regression.
> >> >> >> >
> >> >> >> > To minimize the potential negative performance impact, we want
> to be
> >> >> >> > a little conservative here saying if a function shows up in the
> profile,
> >> >> >> > no matter as outline instance, inline instance or call targets,
> treat
> >> >> >> > the function as not being cold. This will handle the cases such
> as most
> >> >> >> > callsites of a function are inlined in sampled binary (thus
> outline copy
> >> >> >> > don't get any sample) but not inlined in current build (because
> of source
> >> >> >> > code drift, imprecise debug information, or the callsites are
> all cold
> >> >> >> > individually but not cold accumulatively...), so that the
> outline function
> >> >> >> > showing up as cold in sampled binary will actually not be cold
> after current
> >> >> >> > build. After the change, such function will be treated as not
> cold even
> >> >> >> > profile-sample-accurate is enabled.
> >> >> >> >
> >> >> >> > At the same time we lower the hot criteria of callsiteIsHot
> check when
> >> >> >> > profile-sample-accurate is enabled. callsiteIsHot is used to
> determined
> >> >> >> > whether a callsite is hot and qualified for early inlining. When
> >> >> >> > profile-sample-accurate is enabled, functions without profile
> will be
> >> >> >> > regarded as cold and much less inlining will happen in CGSCC
> inlining pass,
> >> >> >> > so we can worry less about size increase and be aggressive to
> allow more
> >> >> >> > early inlining to happen for warm callsites and it is helpful
> for performance
> >> >> >> > overall.
> >> >> >> >
> >> >> >> > Differential Revision: https://reviews.llvm.org/D67561
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190925/ebd17f91/attachment.html>