[llvm-dev] RFC: PGO Late instrumentation for LLVM
Xinliang David Li via llvm-dev
llvm-dev at lists.llvm.org
Tue Sep 1 11:57:36 PDT 2015
On Tue, Sep 1, 2015 at 11:47 AM, Sean Silva via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Tue, Sep 1, 2015 at 11:03 AM, Rong Xu <xur at google.com> wrote:
>
>> Justin, Sean and other people interested in this proposal,
>>
>> I'm wondering if you have chances to read the new experiment results in
>> my last email sent 2 weeks ago. Can you share you thoughts, or you have
>> other tests that you want to to run?
>>
>
> See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation
> pass makes sense (you didn't need to provide any performance data to
> support this; there are plenty of good reasons), but there are a couple
> independent parts. Have you been able to work on splitting out any of them?
>
>
>>
>> I'm in the final stage of preparing the patch. If you are OK, I can sent
>> out the patch soon.
>>
>
> I'm not sure what you mean by "the" patch. It seems pretty clear that
> there are multiple sub-parts to this. Could you send an RFC for part 1 that
> I described? We especially need to discuss the interface for frontends e.g.
> clang command line interface, when a user passes a profile file how do we
> thread that information back to the middle-end, details for the runtime
> interoperation (things like function hash will have different meaning
> between IR-level and Clang instrumentation), etc.
>
>
Those are good suggestions!
thanks,
David
> -- Sean Silva
>
>
>>
>> Thanks,
>>
>> -Rong
>>
>> On Wed, Aug 19, 2015 at 5:18 PM, Philip Reames <listmail at philipreames.com
>> > wrote:
>>
>>> Thank you for sharing the data. I haven't been following the
>>> discussion, but this data made for very interesting reading on it's own.
>>>
>>> Philip
>>>
>>>
>>> On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote:
>>>
>>> We collected more data to address some of the questions from the
>>> reviewers. Note this time we use clang itself as the benchmark. We choose
>>> clang because we think it's a typical C++ program and the reviewers here
>>> have good knowledge of the code base.
>>>
>>> What we measure is running time for clang to compile a large
>>> preprocessed source file (4.98M lines of .ii file), using different
>>> compilation modes. All the numbers reported here are the average running
>>> time of 5 runs in seconds.
>>>
>>> *(1) Performance b/w late instrumentation v.s. not instrumenting single
>>> BB functions*
>>>
>>> We first compare various instrumentation performance.
>>>
>>> ----------------------------------------------------------------------------
>>> Config wall_time_for_instr ratio_vs_base
>>> profile_size
>>> (1) base O2 80.386 100.0% --
>>> (2) FE-based Instr 201.658 250.8%
>>> 65238880
>>> (3) late Instr 103.662 129.0%
>>> 14860144
>>> (4) (3) + w/o pre-inline 199.924 248.7%
>>> 70762720
>>> (5) (4) + Silva 119.904 149.2%
>>> 24499528
>>>
>>> Config(5) used the simple heuristic that Sean Silva proposed: not
>>> instrumenting single BB functions that contain less than 10 instructions
>>> (excluding debug and phi stmts).
>>>
>>> We can see:
>>> 1) Simple heuristic of not instrumenting small single BB functions
>>> improves instrumentation performance as expected.
>>> 2) Using simple heuristic is still slower than late instrumentation with
>>> pre-inlining: the later is 15% faster.
>>> 3) Late instrumentation produces the smallest profile size: it's 39%
>>> smaller than using the simple heuristic.
>>>
>>> The result is expected as pre-inlining can handle more cases than the
>>> simple heuristic. There is significant performance gap between the simple
>>> heuristic (5) and late instrumentation (2).
>>>
>>> We also used a few larger internal benchmarks to further validate the
>>> above result. The following table shows the slowdown compared to the base
>>> O2. The labels (2) to (5) refer to the same config as in the previous table.
>>> ------------------------------------------------------
>>> Program (2) (3) (4) (5)
>>> C++benchmark16 -45.24% -12.93% -43.84% -24.74%
>>> C++benchmark17 -90.86% -58.19% -87.77% -80.62%
>>> C++benchmark18 -95.32% -54.75% -91.21% -82.56%
>>>
>>>
>>> We can see the same trend as the clang benchmark: the simple heuristic
>>> (5) recovers a lot of performance loss compared with FE base
>>> instrumentation, but is still significantly worse than late instrumentation
>>> (3).
>>>
>>> *(2) Performance impact of context sensitivity*
>>>
>>> LLVM does not use the profile information fully in the back-end
>>> optimizations, for instance, inlining does not fully use the profile counts
>>> -- it only marks hot/cold function attribute based on function entry
>>> counts. To evaluate the impact of profile context sensitivity, GCC is used
>>> in the experiment. Note that GCC PGO improves clang performance a lot more
>>> than clang PGO.
>>>
>>> First we summarize the methodology used in the experiment:
>>> 0) build clang with GCC O2 without early inlining and measure clang's
>>> performance. GCC early inlining (einline) is similar to pre-inline used by
>>> late instrumentation.
>>> 1) build clang with GCC O2 with early inlining and measure performance.
>>>
>>> The performance difference of 1) and 0) is denoted as E which measures
>>> the contribution of early inlining.
>>>
>>> 2) build clang with GCC O2 + PGO without early inlining.
>>> 3) build clang with GCC O2 + PGO with early inlining.
>>>
>>> The performance difference of 3) and 2) is denoted as EC. It constitutes
>>> roughly two parts a) early inlining contribution b) context sensitive
>>> profiling enabled with early inlining.
>>>
>>> The contribution of context sensitive profiling can be estimated by EC -
>>> E above.
>>>
>>> -------------------------------------------------------------------------------
>>> Config wall_time_for_use speedup_vs_(0)
>>> speedup_vs_(1)
>>> (0) base w/o einline 84.946 1.000 0.934
>>> (1) base O2 79.310 1.071 1.000
>>> (2) profile-arcs w/o einline 63.518 1.337 1.249
>>> (3) profile-arcs 48.364 1.756 1.640
>>>
>>> We see the following:
>>> 1) GCC PGO with early inlining improves clang performance by 64.0% (v.s.
>>> base O2 w/ early inline).
>>> 2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s.
>>> base O2 w/o early inline).
>>> 3) Early inlining performance contribution is about 7.1%.
>>> 4) Profile context sensitivity contribution is estimated to be 22.2%
>>> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
>>>
>>> *(3) Pre-inline pass impact on the value profiling*
>>>
>>> Again, we use GCC as the platform to estimate:
>>>
>>> --------------------------------------------------------
>>> Config wall_time for_instr
>>> (2) profile-arcs 115.720
>>> (3) profile-arcs w/o einline 310.560
>>> (4) profile-generate 139.952
>>> (5) profile-generate w/o einline 680.910
>>>
>>> In GCC, -fprofile-generate does -fprofile-arcs as well as the value
>>> profiling. The above table shows that with value profile, the impact of
>>> pre-inlining is even larger for instrumented binary performance. Without
>>> value profiling, disabling pre-inlining increases runtime by 1.7x, while
>>> with value profiling, its impact is 3.9x increase in runtime.
>>>
>>>
>>> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> One aspect of this that I have not seen discussed is that middle-end
>>>>> instrumentation enables PGO optimizations to front-ends other than Clang.
>>>>>
>>>>> While I agree that FE instrumentation could be improved, it still
>>>>> requires every FE to implement essentially the same common functionality.
>>>>> Having PGO instrumentation generated in the middle-end, allows us every FE
>>>>> to automatically take advantage of PGO.
>>>>>
>>>>
>>>> This is a really good point, and I agree with it. We may have gotten
>>>> off on the wrong foot since Rong's email focused so heavily on comparing
>>>> with the frontend instrumentation. As far as I see it, Rong's proposal has
>>>> a couple different parts:
>>>>
>>>> 1. Infrastructure for IR-level instrumentation-based PGO
>>>> 2. Changes to the pass pipeline so that a hypothetical IR-level
>>>> instrumentation-based PGO is more effective
>>>> 3. MST algorithm with profile feedback for optimal placement of counter
>>>> updates.
>>>>
>>>> I think 1. is a no-brainer, if only so that all LLVM clients can
>>>> benefit from PGO, and also (as you pointed out below) so that it can have
>>>> an exclusive focus on performance. If it is sufficiently flexible, it may
>>>> even make sense to restrict clang's frontend instrumentation-based
>>>> profiling to non-performance stuff, and have clang directly interoperate
>>>> with the IR-level PGO for performance-related PGO use cases, just like any
>>>> other frontend would.
>>>>
>>>> Philip and Sanjoy, out of curiosity do you guys use your own
>>>> instrumentation placement for PGO? Is an IR-level PGO infrastructure
>>>> upstream something you guys would be interested in?
>>>>
>>>> I think that 2. is something that once we have 1. we will be able to
>>>> evaluate better, but for now my opinion is that we should be able to make
>>>> good progress without digging into that.
>>>>
>>>> I think that 3. is a no-brainer if it provides a really significant
>>>> win, but without 1. we can't really measure its effect in isolation. It
>>>> also has a usability problem since it requires feeding in an existing
>>>> profile for the *instrumented* build, but if the benefit is very
>>>> significant this may be worth it for some users. We will probably be able
>>>> to easily refactor 1. as needed into an MST approach that degrades
>>>> gracefully to using static heuristics in the absence of real profile
>>>> information, so is not a maintenance burden (maybe even helps by providing
>>>> a good framework in which to develop effective static heuristics).
>>>>
>>>> For the time being, I think we can avoid discussion of 2. and 3. until
>>>> we have more of 1. working. So I think it would be most productive if we
>>>> focus this discussion on 1.
>>>>
>>>>
>>>>> Additionally, some of the overhead imposed by FE instrumentation is
>>>>> not really all that easy to get rid of. You end up duplicating
>>>>> functionality that is more naturally implemented in the middle end.
>>>>>
>>>>
>>>> Yeah, I was looking into a couple of other simple approaches and
>>>> quickly found out that I was basically replicating much of the sort of
>>>> logic that the inliner already has.
>>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>>
>>>>> I see the two approaches as supplementary, rather than complementary.
>>>>> One does not negate the other. Some of the optimizations we'd do in the
>>>>> FE, may hurt coverage. Instead, by instrumenting in the middle end, you
>>>>> can focus exclusively on performance (coverage be damned).
>>>>>
>>>>>
>>>>> Diego.
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org http://llvm.cs.uiuc.edu
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing listllvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150901/7f285f65/attachment.html>
More information about the llvm-dev
mailing list