[llvm-dev] RFC: PGO Late instrumentation for LLVM

Tue Sep 1 14:48:12 PDT 2015

On Tue, Sep 1, 2015 at 2:21 PM, Rong Xu <xur at google.com> wrote:

>
>
> On Tue, Sep 1, 2015 at 11:47 AM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Tue, Sep 1, 2015 at 11:03 AM, Rong Xu <xur at google.com> wrote:
>>
>>> Justin, Sean and other people interested in this proposal,
>>>
>>> I'm wondering if you have chances to read the new experiment results in
>>> my last email sent 2 weeks ago.  Can you share you thoughts, or you have
>>> other tests that you want to to run?
>>>
>>
>> See my email from Aug 11 (3 weeks ago). Adding an IR-level
>> instrumentation pass makes sense (you didn't need to provide any
>> performance data to support this; there are plenty of good reasons), but
>> there are a couple independent parts. Have you been able to work on
>> splitting out any of them?
>>
>
> I re-read your comments from Aug 11.
> >As far as I see it, Rong's proposal has a couple different parts:
> >
> >1. Infrastructure for IR-level instrumentation-based PGO
> >2. Changes to the pass pipeline so that a hypothetical IR-level
> instrumentation-based PGO is more effective
> >3. MST algorithm with profile feedback for optimal placement of counter
> updates.
>
> In my implementation, MST algorithm is the main component of 1.
>

I guess for 3. I mean "profile feedback for optimal placement of counter
updates" since that has significant usability implications (needing to pass
extra files). I have nothing against the MST algorithm per se.

-- Sean Silva

>   The only IR change is to insert instrprof_increment intrinsic calls
> (which will be lower in createInstrProfilingPass).
> I'm not quite sure about 3. Do you mean MST algorithm, or using one
> profile to guide the MST algorithm to get the optimal placement? I do have
> the code for both. But the latter one was just for experimental purpose. It
> gonna be hard to use in the real applications (for example, the profile-use
> would also need the bootstrap profile to read the real profile).
>
>
>>
>>>
>>> I'm in the final stage of preparing the patch. If you are OK, I can sent
>>> out the patch soon.
>>>
>>
>> I'm not sure what you mean by "the" patch. It seems pretty clear that
>> there are multiple sub-parts to this. Could you send an RFC for part 1 that
>> I described? We especially need to discuss the interface for frontends e.g.
>> clang command line interface, when a user passes a profile file how do we
>> thread that information back to the middle-end, details for the runtime
>> interoperation (things like function hash will have different meaning
>> between IR-level and Clang instrumentation), etc.
>>
>
> I agree with your approach. When I said "the patch", I really meant 'a
> series of patches'.
>
> Thanks for the suggestion.
>
> -Rong
>
>
>>
>> -- Sean Silva
>>
>>
>>>
>>> Thanks,
>>>
>>> -Rong
>>>
>>> On Wed, Aug 19, 2015 at 5:18 PM, Philip Reames <
>>> listmail at philipreames.com> wrote:
>>>
>>>> Thank you for sharing the data.  I haven't been following the
>>>> discussion, but this data made for very interesting reading on it's own.
>>>>
>>>> Philip
>>>>
>>>>
>>>> On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote:
>>>>
>>>> We collected more data to address some of the questions from the
>>>> reviewers. Note this time we use clang itself as the benchmark. We choose
>>>> clang because we think it's a typical C++ program and the reviewers here
>>>> have good knowledge of the code base.
>>>>
>>>> What we measure is running time for clang to compile a large
>>>> preprocessed source file (4.98M lines of .ii file), using different
>>>> compilation modes. All the numbers reported here are the average running
>>>> time of 5 runs in seconds.
>>>>
>>>> *(1) Performance b/w late instrumentation v.s. not instrumenting single
>>>> BB functions*
>>>>
>>>> We first compare various instrumentation performance.
>>>>
>>>> ----------------------------------------------------------------------------
>>>>   Config                   wall_time_for_instr   ratio_vs_base
>>>> profile_size
>>>> (1) base O2                     80.386             100.0%           --
>>>> (2) FE-based Instr             201.658             250.8%
>>>> 65238880
>>>> (3) late Instr                 103.662             129.0%
>>>> 14860144
>>>> (4) (3) + w/o pre-inline       199.924             248.7%
>>>> 70762720
>>>> (5) (4) + Silva                119.904             149.2%
>>>> 24499528
>>>>
>>>> Config(5) used the simple heuristic that Sean Silva proposed: not
>>>> instrumenting single BB functions that contain less than 10 instructions
>>>> (excluding debug and phi stmts).
>>>>
>>>> We can see:
>>>> 1) Simple heuristic of not instrumenting small single BB functions
>>>> improves instrumentation performance as expected.
>>>> 2) Using simple heuristic is still slower than late instrumentation
>>>> with pre-inlining: the later is 15% faster.
>>>> 3) Late instrumentation produces the smallest profile size: it's 39%
>>>> smaller than using the simple heuristic.
>>>>
>>>> The result is expected as pre-inlining can handle more cases than the
>>>> simple heuristic. There is significant performance gap between the simple
>>>> heuristic (5) and late instrumentation (2).
>>>>
>>>> We also used a few larger internal benchmarks to further validate the
>>>> above result. The following table shows the slowdown compared to the base
>>>> O2. The labels (2) to (5) refer to the same config as in the previous table.
>>>> ------------------------------------------------------
>>>> Program                (2)      (3)      (4)      (5)
>>>> C++benchmark16      -45.24%  -12.93%  -43.84%  -24.74%
>>>> C++benchmark17      -90.86%  -58.19%  -87.77%  -80.62%
>>>> C++benchmark18      -95.32%  -54.75%  -91.21%  -82.56%
>>>>
>>>>
>>>> We can see the same trend as the clang benchmark: the simple heuristic
>>>> (5) recovers a lot of performance loss compared with FE base
>>>> instrumentation, but is still significantly worse than late instrumentation
>>>> (3).
>>>>
>>>> *(2) Performance impact of context sensitivity*
>>>>
>>>> LLVM does not use the profile information fully in the back-end
>>>> optimizations, for instance, inlining does not fully use the profile counts
>>>> -- it only marks hot/cold function attribute based on function entry
>>>> counts. To evaluate the impact of profile context sensitivity, GCC is used
>>>> in the experiment. Note that GCC PGO improves clang performance a lot more
>>>> than clang PGO.
>>>>
>>>> First we summarize the methodology used in the experiment:
>>>> 0)  build clang with GCC O2 without early inlining and measure clang's
>>>> performance. GCC early inlining (einline) is similar to pre-inline used by
>>>> late instrumentation.
>>>> 1) build clang with GCC O2 with early inlining and measure performance.
>>>>
>>>> The performance difference of 1) and 0) is denoted as E which measures
>>>> the contribution of early inlining.
>>>>
>>>> 2) build clang with GCC O2 + PGO without early inlining.
>>>> 3) build clang with GCC O2 + PGO with early inlining.
>>>>
>>>> The performance difference of 3) and 2) is denoted as EC. It
>>>> constitutes roughly two parts a) early inlining contribution b) context
>>>> sensitive profiling enabled with early inlining.
>>>>
>>>> The contribution of context sensitive profiling can be estimated by EC
>>>> - E above.
>>>>
>>>> -------------------------------------------------------------------------------
>>>> Config                        wall_time_for_use  speedup_vs_(0)
>>>>  speedup_vs_(1)
>>>> (0) base w/o einline             84.946            1.000          0.934
>>>> (1) base O2                      79.310            1.071          1.000
>>>> (2) profile-arcs w/o einline     63.518            1.337          1.249
>>>> (3) profile-arcs                 48.364            1.756          1.640
>>>>
>>>> We see the following:
>>>> 1) GCC PGO with early inlining improves clang performance by 64.0%
>>>> (v.s. base O2 w/ early inline).
>>>> 2) GCC PGO w/o early inlining improves clang performance by 33.7% (v.s.
>>>> base O2 w/o early inline).
>>>> 3) Early inlining performance contribution is about 7.1%.
>>>> 4) Profile context sensitivity contribution is estimated to be 22.2%
>>>> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
>>>>
>>>> *(3) Pre-inline pass impact on the value profiling*
>>>>
>>>> Again, we use GCC as the platform to estimate:
>>>>
>>>> --------------------------------------------------------
>>>>   Config                            wall_time for_instr
>>>> (2) profile-arcs                      115.720
>>>> (3) profile-arcs w/o einline          310.560
>>>> (4) profile-generate                  139.952
>>>> (5) profile-generate w/o einline      680.910
>>>>
>>>> In GCC, -fprofile-generate does -fprofile-arcs as well as the value
>>>> profiling. The above table shows that with value profile, the impact of
>>>> pre-inlining is even larger for instrumented binary performance. Without
>>>> value  profiling, disabling pre-inlining increases runtime by 1.7x, while
>>>> with value profiling, its impact is 3.9x increase in runtime.
>>>>
>>>>
>>>> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> One aspect of this that I have not seen discussed is that middle-end
>>>>>> instrumentation enables PGO optimizations to front-ends other than Clang.
>>>>>>
>>>>>> While I agree that FE instrumentation could be improved, it still
>>>>>> requires every FE to implement essentially the same common functionality.
>>>>>> Having PGO instrumentation generated in the middle-end, allows us every FE
>>>>>> to automatically take advantage of PGO.
>>>>>>
>>>>>
>>>>> This is a really good point, and I agree with it. We may have gotten
>>>>> off on the wrong foot since Rong's email focused so heavily on comparing
>>>>> with the frontend instrumentation. As far as I see it, Rong's proposal has
>>>>> a couple different parts:
>>>>>
>>>>> 1. Infrastructure for IR-level instrumentation-based PGO
>>>>> 2. Changes to the pass pipeline so that a hypothetical IR-level
>>>>> instrumentation-based PGO is more effective
>>>>> 3. MST algorithm with profile feedback for optimal placement of
>>>>> counter updates.
>>>>>
>>>>> I think 1. is a no-brainer, if only so that all LLVM clients can
>>>>> benefit from PGO, and also (as you pointed out below) so that it can have
>>>>> an exclusive focus on performance. If it is sufficiently flexible, it may
>>>>> even make sense to restrict clang's frontend instrumentation-based
>>>>> profiling to non-performance stuff, and have clang directly interoperate
>>>>> with the IR-level PGO for performance-related PGO use cases, just like any
>>>>> other frontend would.
>>>>>
>>>>> Philip and Sanjoy, out of curiosity do you guys use your own
>>>>> instrumentation placement for PGO? Is an IR-level PGO infrastructure
>>>>> upstream something you guys would be interested in?
>>>>>
>>>>> I think that 2. is something that once we have 1. we will be able to
>>>>> evaluate better, but for now my opinion is that we should be able to make
>>>>> good progress without digging into that.
>>>>>
>>>>> I think that 3. is a no-brainer if it provides a really significant
>>>>> win, but without 1. we can't really measure its effect in isolation. It
>>>>> also has a usability problem since it requires feeding in an existing
>>>>> profile for the *instrumented* build, but if the benefit is very
>>>>> significant this may be worth it for some users. We will probably be able
>>>>> to easily refactor 1. as needed into an MST approach that degrades
>>>>> gracefully to using static heuristics in the absence of real profile
>>>>> information, so is not a maintenance burden (maybe even helps by providing
>>>>> a good framework in which to develop effective static heuristics).
>>>>>
>>>>> For the time being, I think we can avoid discussion of 2. and 3. until
>>>>> we have more of 1. working. So I think it would be most productive if we
>>>>> focus this discussion on 1.
>>>>>
>>>>>
>>>>>> Additionally, some of the overhead imposed by FE instrumentation is
>>>>>> not really all that easy to get rid of.  You end up duplicating
>>>>>> functionality that is more naturally implemented in the middle end.
>>>>>>
>>>>>
>>>>> Yeah, I was looking into a couple of other simple approaches and
>>>>> quickly found out that I was basically replicating much of the sort of
>>>>> logic that the inliner already has.
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>>
>>>>>>
>>>>>> I see the two approaches as supplementary, rather than
>>>>>> complementary.  One does not negate the other.  Some of the optimizations
>>>>>> we'd do in the FE, may hurt coverage.  Instead, by instrumenting in the
>>>>>> middle end, you can focus exclusively on performance (coverage be damned).
>>>>>>
>>>>>>
>>>>>> Diego.
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing listllvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150901/0a433abd/attachment.html>