[llvm] r372232 - [SampleFDO] Minimize performance impact when profile-sample-accurate

Fri Sep 27 16:21:51 PDT 2019

Hi Hans,

I committed another fix at rL373133, and verified it removed the size
increase from escape-analysis.o (the reproducer). Please take a look.

Thanks,
Wei.

On Wed, Sep 25, 2019 at 12:08 PM Wei Mi <wmi at google.com> wrote:

> Hans, thanks for checking. I will try another way of fixing it.
>
> On Wed, Sep 25, 2019 at 5:24 AM Hans Wennborg <hans at chromium.org> wrote:
>
>>  On Tue, Sep 24, 2019 at 4:37 PM Wei Mi <wmi at google.com> wrote:
>> >
>> > I committed another change trying to remove the most size increase from
>> this change at rL372665. Please see if it can fix the size regression.
>>
>> I tried it on the reproducer at
>> https://bugs.chromium.org/p/chromium/issues/detail?id=1006159#c1
>> (attachments at #c2), but it doesn't seem to have any effect: the
>> binary size is identical before and after r372665.
>>
>> > On Tue, Sep 24, 2019 at 1:16 AM Hans Wennborg <hans at chromium.org>
>> wrote:
>> >>
>> >> On Mon, Sep 23, 2019 at 5:58 PM Wei Mi <wmi at google.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Sep 23, 2019 at 5:49 AM Hans Wennborg <hans at chromium.org>
>> wrote:
>> >> >>
>> >> >> For us, this appears to be a size regression here with no perf gain,
>> >> >> and the Chrome/Android binary size team tracks these things
>> carefully.
>> >> >> How do we motivate this change to them? A 70 KB regression might be
>> >> >> okay if it buys us something else, but if there is no benefit, why
>> >> >> should it be accepted?
>> >> >>
>> >> >>
>> >> >> It sounds like the change is targeting internal Google benchmarks.
>> >> >> Perhaps this should be behind a flag which they can use? Or if you
>> >> >> want to change the default, shouldn't the case be made upstream with
>> >> >> data?
>> >> >>
>> >> >
>> >> > I think it is reasonable to put a guard and not to enable it for Os
>> and Oz, but sounds like the size regression is in pieces built with O2/O3.
>> From my understanding for O2 and O3, usually code size is less of a concern
>> compared with performance. If code size is a big concern, maybe it should
>> use Os/Oz instead of O2/O3 because I think it is common for people to trade
>> off optimal performance against size in O2/O3, inlining is a typical case.
>> >>
>> >> We build different parts of Chrome with different flags. For this
>> >> code, we use -O2/-O3 because we do value performance over size, but in
>> >> this case it's not clear that there is a performance benefit.
>> >
>> >
>> > Ok.
>> >
>> >>
>> >>
>> >> > About performance data, it fixed a 1% performance regression in an
>> important internal benchmark, and we thought the problem was a general one
>> and could potentially exist in other places as well.
>> >>
>> >> Is there any data showing that this is beneficial beyond the 1% on
>> >> this particular benchmark?
>> >
>> >
>> >  No, we havn't done wide evaluation for it.
>> >
>> >>
>> >>
>> >>
>> >>
>> >> >> Reading the change description, it sounds like it's trying to work
>> >> >> around bad profile data:
>> >> >>
>> >> >> "when ProfileSampleAccurate is true, function without sample will be
>> >> >> regarded as cold and this could potentially cause performance
>> >> >> regression."
>> >> >>
>> >> >> What if the function has no sample because it really is cold though?
>> >> >> Inlining cold functions sounds like something we don't want. Would
>> it
>> >> >> be possible to improve the accuracy of your sample data instead?
>> >> >
>> >> >
>> >> > That is not about the accuracy of sample data.  That is caused by
>> some limitation in AutoFDO itself -- use profile collected from optimized
>> binary and annotate it on IR not fully optimized using debug information.
>> There is a gap in between and debug information in optimized binary cannot
>> be perfect. Because of it, if performance is a major concern, it is a
>> little aggressive to treat everything without profile as cold. The patch
>> corrects it a little bit but still keeps the most memory/code size saving
>> from profile-sample-accurate.
>> >> >
>> >> >>
>> >> >>
>> >> >> On Fri, Sep 20, 2019 at 6:13 PM Wei Mi <wmi at google.com> wrote:
>> >> >> >
>> >> >> > The size increase is expected. We are planning to use
>> profile-sample-accurate more widely in Google, but simply enabling it will
>> cause performance regression so we use this patch to minimize it. The size
>> increase is mostly from more inlining, but the increase is small compared
>> with code size reduction from enabling profile-sample-accurate from our
>> experiment.
>> >> >> >
>> >> >> > If code size is more of a concern for chrome on Android than
>> performance, I can guard the change not to be enabled for -Os and Oz (I
>> guess chrome on Android has already been built with Os or Oz).
>> >> >> >
>> >> >> > Please let me know what you think.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Wei.
>> >> >> >
>> >> >> > On Fri, Sep 20, 2019 at 4:14 AM Hans Wennborg <hans at chromium.org>
>> wrote:
>> >> >> >>
>> >> >> >> This caused a ~70 KB size regression for Chrome on Android. Is
>> this expected?
>> >> >> >>
>> >> >> >> Also +gbiv for Chrome Android pgo.
>> >> >> >>
>> >> >> >> We're tracking this in
>> >> >> >> https://bugs.chromium.org/p/chromium/issues/detail?id=1006159
>> >> >> >>
>> >> >> >> On Wed, Sep 18, 2019 at 6:05 PM Wei Mi via llvm-commits
>> >> >> >> <llvm-commits at lists.llvm.org> wrote:
>> >> >> >> >
>> >> >> >> > Author: wmi
>> >> >> >> > Date: Wed Sep 18 09:06:28 2019
>> >> >> >> > New Revision: 372232
>> >> >> >> >
>> >> >> >> > URL: http://llvm.org/viewvc/llvm-project?rev=372232&view=rev
>> >> >> >> > Log:
>> >> >> >> > [SampleFDO] Minimize performance impact when
>> profile-sample-accurate
>> >> >> >> > is enabled.
>> >> >> >> >
>> >> >> >> > We can save memory and reduce binary size significantly by
>> enabling
>> >> >> >> > ProfileSampleAccurate. However when ProfileSampleAccurate is
>> true,
>> >> >> >> > function without sample will be regarded as cold and this could
>> >> >> >> > potentially cause performance regression.
>> >> >> >> >
>> >> >> >> > To minimize the potential negative performance impact, we want
>> to be
>> >> >> >> > a little conservative here saying if a function shows up in
>> the profile,
>> >> >> >> > no matter as outline instance, inline instance or call
>> targets, treat
>> >> >> >> > the function as not being cold. This will handle the cases
>> such as most
>> >> >> >> > callsites of a function are inlined in sampled binary (thus
>> outline copy
>> >> >> >> > don't get any sample) but not inlined in current build
>> (because of source
>> >> >> >> > code drift, imprecise debug information, or the callsites are
>> all cold
>> >> >> >> > individually but not cold accumulatively...), so that the
>> outline function
>> >> >> >> > showing up as cold in sampled binary will actually not be cold
>> after current
>> >> >> >> > build. After the change, such function will be treated as not
>> cold even
>> >> >> >> > profile-sample-accurate is enabled.
>> >> >> >> >
>> >> >> >> > At the same time we lower the hot criteria of callsiteIsHot
>> check when
>> >> >> >> > profile-sample-accurate is enabled. callsiteIsHot is used to
>> determined
>> >> >> >> > whether a callsite is hot and qualified for early inlining.
>> When
>> >> >> >> > profile-sample-accurate is enabled, functions without profile
>> will be
>> >> >> >> > regarded as cold and much less inlining will happen in CGSCC
>> inlining pass,
>> >> >> >> > so we can worry less about size increase and be aggressive to
>> allow more
>> >> >> >> > early inlining to happen for warm callsites and it is helpful
>> for performance
>> >> >> >> > overall.
>> >> >> >> >
>> >> >> >> > Differential Revision: https://reviews.llvm.org/D67561
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190927/33f93a66/attachment.html>