[llvm-dev] RFC: PGO Late instrumentation for LLVM

Tue Sep 1 11:47:26 PDT 2015

On Tue, Sep 1, 2015 at 11:03 AM, Rong Xu <xur at google.com> wrote:

> Justin, Sean and other people interested in this proposal,
>
> I'm wondering if you have chances to read the new experiment results in my
> last email sent 2 weeks ago.  Can you share you thoughts, or you have other
> tests that you want to to run?
>

See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation
pass makes sense (you didn't need to provide any performance data to
support this; there are plenty of good reasons), but there are a couple
independent parts. Have you been able to work on splitting out any of them?

>
> I'm in the final stage of preparing the patch. If you are OK, I can sent
> out the patch soon.
>

I'm not sure what you mean by "the" patch. It seems pretty clear that there
are multiple sub-parts to this. Could you send an RFC for part 1 that I
described? We especially need to discuss the interface for frontends e.g.
clang command line interface, when a user passes a profile file how do we
thread that information back to the middle-end, details for the runtime
interoperation (things like function hash will have different meaning
between IR-level and Clang instrumentation), etc.

-- Sean Silva

>
> Thanks,
>
> -Rong
>
> On Wed, Aug 19, 2015 at 5:18 PM, Philip Reames <listmail at philipreames.com>
> wrote:
>
>> Thank you for sharing the data.  I haven't been following the discussion,
>> but this data made for very interesting reading on it's own.
>>
>> Philip
>>
>>
>> On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote:
>>
>> We collected more data to address some of the questions from the
>> reviewers. Note this time we use clang itself as the benchmark. We choose
>> clang because we think it's a typical C++ program and the reviewers here
>> have good knowledge of the code base.
>>
>> What we measure is running time for clang to compile a large preprocessed
>> source file (4.98M lines of .ii file), using different compilation modes.
>> All the numbers reported here are the average running time of 5 runs in
>> seconds.
>>
>> *(1) Performance b/w late instrumentation v.s. not instrumenting single
>> BB functions*
>>
>> We first compare various instrumentation performance.
>>
>> ----------------------------------------------------------------------------
>>   Config                   wall_time_for_instr   ratio_vs_base
>> profile_size
>> (1) base O2                     80.386             100.0%           --
>> (2) FE-based Instr             201.658             250.8%         65238880
>> (3) late Instr                 103.662             129.0%         14860144
>> (4) (3) + w/o pre-inline       199.924             248.7%         70762720
>> (5) (4) + Silva                119.904             149.2%         24499528
>>
>> Config(5) used the simple heuristic that Sean Silva proposed: not
>> instrumenting single BB functions that contain less than 10 instructions
>> (excluding debug and phi stmts).
>>
>> We can see:
>> 1) Simple heuristic of not instrumenting small single BB functions
>> improves instrumentation performance as expected.
>> 2) Using simple heuristic is still slower than late instrumentation with
>> pre-inlining: the later is 15% faster.
>> 3) Late instrumentation produces the smallest profile size: it's 39%
>> smaller than using the simple heuristic.
>>
>> The result is expected as pre-inlining can handle more cases than the
>> simple heuristic. There is significant performance gap between the simple
>> heuristic (5) and late instrumentation (2).
>>
>> We also used a few larger internal benchmarks to further validate the
>> above result. The following table shows the slowdown compared to the base
>> O2. The labels (2) to (5) refer to the same config as in the previous table.
>> ------------------------------------------------------
>> Program                (2)      (3)      (4)      (5)
>> C++benchmark16      -45.24%  -12.93%  -43.84%  -24.74%
>> C++benchmark17      -90.86%  -58.19%  -87.77%  -80.62%
>> C++benchmark18      -95.32%  -54.75%  -91.21%  -82.56%
>>
>>
>> We can see the same trend as the clang benchmark: the simple heuristic
>> (5) recovers a lot of performance loss compared with FE base
>> instrumentation, but is still significantly worse than late instrumentation
>> (3).
>>
>> *(2) Performance impact of context sensitivity*
>>
>> LLVM does not use the profile information fully in the back-end
>> optimizations, for instance, inlining does not fully use the profile counts
>> -- it only marks hot/cold function attribute based on function entry
>> counts. To evaluate the impact of profile context sensitivity, GCC is used
>> in the experiment. Note that GCC PGO improves clang performance a lot more
>> than clang PGO.
>>
>> First we summarize the methodology used in the experiment:
>> 0)  build clang with GCC O2 without early inlining and measure clang's
>> performance. GCC early inlining (einline) is similar to pre-inline used by
>> late instrumentation.
>> 1) build clang with GCC O2 with early inlining and measure performance.
>>
>> The performance difference of 1) and 0) is denoted as E which measures
>> the contribution of early inlining.
>>
>> 2) build clang with GCC O2 + PGO without early inlining.
>> 3) build clang with GCC O2 + PGO with early inlining.
>>
>> The performance difference of 3) and 2) is denoted as EC. It constitutes
>> roughly two parts a) early inlining contribution b) context sensitive
>> profiling enabled with early inlining.
>>
>> The contribution of context sensitive profiling can be estimated by EC -
>> E above.
>>
>> -------------------------------------------------------------------------------
>> Config                        wall_time_for_use  speedup_vs_(0)
>>  speedup_vs_(1)
>> (0) base w/o einline             84.946            1.000          0.934
>> (1) base O2                      79.310            1.071          1.000
>> (2) profile-arcs w/o einline     63.518            1.337          1.249
>> (3) profile-arcs                 48.364            1.756          1.640
>>
>> We see the following:
>> 1) GCC PGO with early inlining improves clang performance by 64.0% (v.s.
>> base O2 w/ early inline).
>> 2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s.
>> base O2 w/o early inline).
>> 3) Early inlining performance contribution is about 7.1%.
>> 4) Profile context sensitivity contribution is estimated to be 22.2%
>> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
>>
>> *(3) Pre-inline pass impact on the value profiling*
>>
>> Again, we use GCC as the platform to estimate:
>>
>> --------------------------------------------------------
>>   Config                            wall_time for_instr
>> (2) profile-arcs                      115.720
>> (3) profile-arcs w/o einline          310.560
>> (4) profile-generate                  139.952
>> (5) profile-generate w/o einline      680.910
>>
>> In GCC, -fprofile-generate does -fprofile-arcs as well as the value
>> profiling. The above table shows that with value profile, the impact of
>> pre-inlining is even larger for instrumented binary performance. Without
>> value  profiling, disabling pre-inlining increases runtime by 1.7x, while
>> with value profiling, its impact is 3.9x increase in runtime.
>>
>>
>> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>>
>>> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> One aspect of this that I have not seen discussed is that middle-end
>>>> instrumentation enables PGO optimizations to front-ends other than Clang.
>>>>
>>>> While I agree that FE instrumentation could be improved, it still
>>>> requires every FE to implement essentially the same common functionality.
>>>> Having PGO instrumentation generated in the middle-end, allows us every FE
>>>> to automatically take advantage of PGO.
>>>>
>>>
>>> This is a really good point, and I agree with it. We may have gotten off
>>> on the wrong foot since Rong's email focused so heavily on comparing with
>>> the frontend instrumentation. As far as I see it, Rong's proposal has a
>>> couple different parts:
>>>
>>> 1. Infrastructure for IR-level instrumentation-based PGO
>>> 2. Changes to the pass pipeline so that a hypothetical IR-level
>>> instrumentation-based PGO is more effective
>>> 3. MST algorithm with profile feedback for optimal placement of counter
>>> updates.
>>>
>>> I think 1. is a no-brainer, if only so that all LLVM clients can benefit
>>> from PGO, and also (as you pointed out below) so that it can have an
>>> exclusive focus on performance. If it is sufficiently flexible, it may even
>>> make sense to restrict clang's frontend instrumentation-based profiling to
>>> non-performance stuff, and have clang directly interoperate with the
>>> IR-level PGO for performance-related PGO use cases, just like any other
>>> frontend would.
>>>
>>> Philip and Sanjoy, out of curiosity do you guys use your own
>>> instrumentation placement for PGO? Is an IR-level PGO infrastructure
>>> upstream something you guys would be interested in?
>>>
>>> I think that 2. is something that once we have 1. we will be able to
>>> evaluate better, but for now my opinion is that we should be able to make
>>> good progress without digging into that.
>>>
>>> I think that 3. is a no-brainer if it provides a really significant win,
>>> but without 1. we can't really measure its effect in isolation. It also has
>>> a usability problem since it requires feeding in an existing profile for
>>> the *instrumented* build, but if the benefit is very significant this may
>>> be worth it for some users. We will probably be able to easily refactor 1.
>>> as needed into an MST approach that degrades gracefully to using static
>>> heuristics in the absence of real profile information, so is not a
>>> maintenance burden (maybe even helps by providing a good framework in which
>>> to develop effective static heuristics).
>>>
>>> For the time being, I think we can avoid discussion of 2. and 3. until
>>> we have more of 1. working. So I think it would be most productive if we
>>> focus this discussion on 1.
>>>
>>>
>>>> Additionally, some of the overhead imposed by FE instrumentation is not
>>>> really all that easy to get rid of.  You end up duplicating functionality
>>>> that is more naturally implemented in the middle end.
>>>>
>>>
>>> Yeah, I was looking into a couple of other simple approaches and quickly
>>> found out that I was basically replicating much of the sort of logic that
>>> the inliner already has.
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>> I see the two approaches as supplementary, rather than complementary.
>>>> One does not negate the other.  Some of the optimizations we'd do in the
>>>> FE, may hurt coverage.  Instead, by instrumenting in the middle end, you
>>>> can focus exclusively on performance (coverage be damned).
>>>>
>>>>
>>>> Diego.
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150901/53b882f3/attachment.html>