[llvm] r372232 - [SampleFDO] Minimize performance impact when profile-sample-accurate

Wed Sep 25 05:24:18 PDT 2019

 On Tue, Sep 24, 2019 at 4:37 PM Wei Mi <wmi at google.com> wrote:
>
> I committed another change trying to remove the most size increase from this change at rL372665. Please see if it can fix the size regression.

I tried it on the reproducer at
https://bugs.chromium.org/p/chromium/issues/detail?id=1006159#c1
(attachments at #c2), but it doesn't seem to have any effect: the
binary size is identical before and after r372665.

> On Tue, Sep 24, 2019 at 1:16 AM Hans Wennborg <hans at chromium.org> wrote:
>>
>> On Mon, Sep 23, 2019 at 5:58 PM Wei Mi <wmi at google.com> wrote:
>> >
>> >
>> >
>> > On Mon, Sep 23, 2019 at 5:49 AM Hans Wennborg <hans at chromium.org> wrote:
>> >>
>> >> For us, this appears to be a size regression here with no perf gain,
>> >> and the Chrome/Android binary size team tracks these things carefully.
>> >> How do we motivate this change to them? A 70 KB regression might be
>> >> okay if it buys us something else, but if there is no benefit, why
>> >> should it be accepted?
>> >>
>> >>
>> >> It sounds like the change is targeting internal Google benchmarks.
>> >> Perhaps this should be behind a flag which they can use? Or if you
>> >> want to change the default, shouldn't the case be made upstream with
>> >> data?
>> >>
>> >
>> > I think it is reasonable to put a guard and not to enable it for Os and Oz, but sounds like the size regression is in pieces built with O2/O3. From my understanding for O2 and O3, usually code size is less of a concern compared with performance. If code size is a big concern, maybe it should use Os/Oz instead of O2/O3 because I think it is common for people to trade off optimal performance against size in O2/O3, inlining is a typical case.
>>
>> We build different parts of Chrome with different flags. For this
>> code, we use -O2/-O3 because we do value performance over size, but in
>> this case it's not clear that there is a performance benefit.
>
>
> Ok.
>
>>
>>
>> > About performance data, it fixed a 1% performance regression in an important internal benchmark, and we thought the problem was a general one and could potentially exist in other places as well.
>>
>> Is there any data showing that this is beneficial beyond the 1% on
>> this particular benchmark?
>
>
>  No, we havn't done wide evaluation for it.
>
>>
>>
>>
>>
>> >> Reading the change description, it sounds like it's trying to work
>> >> around bad profile data:
>> >>
>> >> "when ProfileSampleAccurate is true, function without sample will be
>> >> regarded as cold and this could potentially cause performance
>> >> regression."
>> >>
>> >> What if the function has no sample because it really is cold though?
>> >> Inlining cold functions sounds like something we don't want. Would it
>> >> be possible to improve the accuracy of your sample data instead?
>> >
>> >
>> > That is not about the accuracy of sample data.  That is caused by some limitation in AutoFDO itself -- use profile collected from optimized binary and annotate it on IR not fully optimized using debug information. There is a gap in between and debug information in optimized binary cannot be perfect. Because of it, if performance is a major concern, it is a little aggressive to treat everything without profile as cold. The patch corrects it a little bit but still keeps the most memory/code size saving from profile-sample-accurate.
>> >
>> >>
>> >>
>> >> On Fri, Sep 20, 2019 at 6:13 PM Wei Mi <wmi at google.com> wrote:
>> >> >
>> >> > The size increase is expected. We are planning to use profile-sample-accurate more widely in Google, but simply enabling it will cause performance regression so we use this patch to minimize it. The size increase is mostly from more inlining, but the increase is small compared with code size reduction from enabling profile-sample-accurate from our experiment.
>> >> >
>> >> > If code size is more of a concern for chrome on Android than performance, I can guard the change not to be enabled for -Os and Oz (I guess chrome on Android has already been built with Os or Oz).
>> >> >
>> >> > Please let me know what you think.
>> >> >
>> >> > Thanks,
>> >> > Wei.
>> >> >
>> >> > On Fri, Sep 20, 2019 at 4:14 AM Hans Wennborg <hans at chromium.org> wrote:
>> >> >>
>> >> >> This caused a ~70 KB size regression for Chrome on Android. Is this expected?
>> >> >>
>> >> >> Also +gbiv for Chrome Android pgo.
>> >> >>
>> >> >> We're tracking this in
>> >> >> https://bugs.chromium.org/p/chromium/issues/detail?id=1006159
>> >> >>
>> >> >> On Wed, Sep 18, 2019 at 6:05 PM Wei Mi via llvm-commits
>> >> >> <llvm-commits at lists.llvm.org> wrote:
>> >> >> >
>> >> >> > Author: wmi
>> >> >> > Date: Wed Sep 18 09:06:28 2019
>> >> >> > New Revision: 372232
>> >> >> >
>> >> >> > URL: http://llvm.org/viewvc/llvm-project?rev=372232&view=rev
>> >> >> > Log:
>> >> >> > [SampleFDO] Minimize performance impact when profile-sample-accurate
>> >> >> > is enabled.
>> >> >> >
>> >> >> > We can save memory and reduce binary size significantly by enabling
>> >> >> > ProfileSampleAccurate. However when ProfileSampleAccurate is true,
>> >> >> > function without sample will be regarded as cold and this could
>> >> >> > potentially cause performance regression.
>> >> >> >
>> >> >> > To minimize the potential negative performance impact, we want to be
>> >> >> > a little conservative here saying if a function shows up in the profile,
>> >> >> > no matter as outline instance, inline instance or call targets, treat
>> >> >> > the function as not being cold. This will handle the cases such as most
>> >> >> > callsites of a function are inlined in sampled binary (thus outline copy
>> >> >> > don't get any sample) but not inlined in current build (because of source
>> >> >> > code drift, imprecise debug information, or the callsites are all cold
>> >> >> > individually but not cold accumulatively...), so that the outline function
>> >> >> > showing up as cold in sampled binary will actually not be cold after current
>> >> >> > build. After the change, such function will be treated as not cold even
>> >> >> > profile-sample-accurate is enabled.
>> >> >> >
>> >> >> > At the same time we lower the hot criteria of callsiteIsHot check when
>> >> >> > profile-sample-accurate is enabled. callsiteIsHot is used to determined
>> >> >> > whether a callsite is hot and qualified for early inlining. When
>> >> >> > profile-sample-accurate is enabled, functions without profile will be
>> >> >> > regarded as cold and much less inlining will happen in CGSCC inlining pass,
>> >> >> > so we can worry less about size increase and be aggressive to allow more
>> >> >> > early inlining to happen for warm callsites and it is helpful for performance
>> >> >> > overall.
>> >> >> >
>> >> >> > Differential Revision: https://reviews.llvm.org/D67561