[llvm-dev] RFC: PGO Late instrumentation for LLVM

Wed Aug 19 17:18:18 PDT 2015

Thank you for sharing the data.  I haven't been following the 
discussion, but this data made for very interesting reading on it's own.

Philip

On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote:
> We collected more data to address some of the questions from the 
> reviewers. Note this time we use clang itself as the benchmark. We 
> choose clang because we think it's a typical C++ program and the 
> reviewers here have good knowledge of the code base.
>
> What we measure is running time for clang to compile a large 
> preprocessed source file (4.98M lines of .ii file), using different 
> compilation modes. All the numbers reported here are the average 
> running time of 5 runs in seconds.
>
> *(1) Performance b/w late instrumentation v.s. not instrumenting 
> single BB functions*
>
> We first compare various instrumentation performance.
> ----------------------------------------------------------------------------
>   Config   wall_time_for_instr   ratio_vs_base   profile_size
> (1) base O2       80.386             100.0%           --
> (2) FE-based Instr       201.658             250.8%         65238880
> (3) late Instr       103.662             129.0%         14860144
> (4) (3) + w/o pre-inline       199.924             248.7%         70762720
> (5) (4) + Silva      119.904             149.2%         24499528
>
> Config(5) used the simple heuristic that Sean Silva proposed: not 
> instrumenting single BB functions that contain less than 10 
> instructions (excluding debug and phi stmts).
>
> We can see:
> 1) Simple heuristic of not instrumenting small single BB functions 
> improves instrumentation performance as expected.
> 2) Using simple heuristic is still slower than late instrumentation 
> with pre-inlining: the later is 15% faster.
> 3) Late instrumentation produces the smallest profile size: it's 39% 
> smaller than using the simple heuristic.
>
> The result is expected as pre-inlining can handle more cases than the 
> simple heuristic. There is significant performance gap between the 
> simple heuristic (5) and late instrumentation (2).
>
> We also used a few larger internal benchmarks to further validate the 
> above result. The following table shows the slowdown compared to the 
> base O2. The labels (2) to (5) refer to the same config as in the 
> previous table.
> ------------------------------------------------------
> Program  (2)      (3)      (4)      (5)
> C++benchmark16  -45.24%  -12.93%  -43.84%  -24.74%
> C++benchmark17  -90.86%  -58.19%  -87.77%  -80.62%
> C++benchmark18  -95.32%  -54.75%  -91.21%  -82.56%
>
>
> We can see the same trend as the clang benchmark: the simple heuristic 
> (5) recovers a lot of performance loss compared with FE base 
> instrumentation, but is still significantly worse than late 
> instrumentation (3).
>
> *(2) Performance impact of context sensitivity*
>
> LLVM does not use the profile information fully in the back-end 
> optimizations, for instance, inlining does not fully use the profile 
> counts -- it only marks hot/cold function attribute based on function 
> entry counts. To evaluate the impact of profile context sensitivity, 
> GCC is used in the experiment. Note that GCC PGO improves clang 
> performance a lot more than clang PGO.
>
> First we summarize the methodology used in the experiment:
> 0)  build clang with GCC O2 without early inlining and measure clang's 
> performance. GCC early inlining (einline) is similar to pre-inline 
> used by late instrumentation.
> 1) build clang with GCC O2 with early inlining and measure performance.
>
> The performance difference of 1) and 0) is denoted as E which measures 
> the contribution of early inlining.
>
> 2) build clang with GCC O2 + PGO without early inlining.
> 3) build clang with GCC O2 + PGO with early inlining.
>
> The performance difference of 3) and 2) is denoted as EC. It 
> constitutes roughly two parts a) early inlining contribution b) 
> context sensitive profiling enabled with early inlining.
>
> The contribution of context sensitive profiling can be estimated by EC 
> - E above.
> -------------------------------------------------------------------------------
> Config      wall_time_for_use  speedup_vs_(0)  speedup_vs_(1)
> (0) base w/o einline         84.946            1.000          0.934
> (1) base O2        79.310            1.071          1.000
> (2) profile-arcs w/o einline     63.518            1.337          1.249
> (3) profile-arcs         48.364            1.756          1.640
>
> We see the following:
> 1) GCC PGO with early inlining improves clang performance by 64.0% 
> (v.s. base O2 w/ early inline).
> 2) GCC PGO w/o early inlining improves clang performance by 33.7% 
> (v.s. base O2 w/o early inline).
> 3) Early inlining performance contribution is about 7.1%.
> 4) Profile context sensitivity contribution is estimated to be 22.2% 
> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
>
> *(3) Pre-inline pass impact on the value profiling*
>
> Again, we use GCC as the platform to estimate:
>
> --------------------------------------------------------
>   Config            wall_time for_instr
> (2) profile-arcs              115.720
> (3) profile-arcs w/o einline          310.560
> (4) profile-generate              139.952
> (5) profile-generate w/o einline      680.910
>
> In GCC, -fprofile-generate does -fprofile-arcs as well as the value 
> profiling. The above table shows that with value profile, the impact 
> of pre-inlining is even larger for instrumented binary performance. 
> Without value  profiling, disabling pre-inlining increases runtime by 
> 1.7x, while with value profiling, its impact is 3.9x increase in runtime.
>
>
> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>
>     On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev
>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>         One aspect of this that I have not seen discussed is that
>         middle-end instrumentation enables PGO optimizations to
>         front-ends other than Clang.
>
>         While I agree that FE instrumentation could be improved, it
>         still requires every FE to implement essentially the same
>         common functionality.  Having PGO instrumentation generated in
>         the middle-end, allows us every FE to automatically take
>         advantage of PGO.
>
>
>     This is a really good point, and I agree with it. We may have
>     gotten off on the wrong foot since Rong's email focused so heavily
>     on comparing with the frontend instrumentation. As far as I see
>     it, Rong's proposal has a couple different parts:
>
>     1. Infrastructure for IR-level instrumentation-based PGO
>     2. Changes to the pass pipeline so that a hypothetical IR-level
>     instrumentation-based PGO is more effective
>     3. MST algorithm with profile feedback for optimal placement of
>     counter updates.
>
>     I think 1. is a no-brainer, if only so that all LLVM clients can
>     benefit from PGO, and also (as you pointed out below) so that it
>     can have an exclusive focus on performance. If it is sufficiently
>     flexible, it may even make sense to restrict clang's frontend
>     instrumentation-based profiling to non-performance stuff, and have
>     clang directly interoperate with the IR-level PGO for
>     performance-related PGO use cases, just like any other frontend would.
>
>     Philip and Sanjoy, out of curiosity do you guys use your own
>     instrumentation placement for PGO? Is an IR-level PGO
>     infrastructure upstream something you guys would be interested in?
>
>     I think that 2. is something that once we have 1. we will be able
>     to evaluate better, but for now my opinion is that we should be
>     able to make good progress without digging into that.
>
>     I think that 3. is a no-brainer if it provides a really
>     significant win, but without 1. we can't really measure its effect
>     in isolation. It also has a usability problem since it requires
>     feeding in an existing profile for the *instrumented* build, but
>     if the benefit is very significant this may be worth it for some
>     users. We will probably be able to easily refactor 1. as needed
>     into an MST approach that degrades gracefully to using static
>     heuristics in the absence of real profile information, so is not a
>     maintenance burden (maybe even helps by providing a good framework
>     in which to develop effective static heuristics).
>
>     For the time being, I think we can avoid discussion of 2. and 3.
>     until we have more of 1. working. So I think it would be most
>     productive if we focus this discussion on 1.
>
>
>         Additionally, some of the overhead imposed by FE
>         instrumentation is not really all that easy to get rid of. 
>         You end up duplicating functionality that is more naturally
>         implemented in the middle end.
>
>
>     Yeah, I was looking into a couple of other simple approaches and
>     quickly found out that I was basically replicating much of the
>     sort of logic that the inliner already has.
>
>     -- Sean Silva
>
>
>         I see the two approaches as supplementary, rather than
>         complementary.  One does not negate the other.  Some of the
>         optimizations we'd do in the FE, may hurt coverage.  Instead,
>         by instrumenting in the middle end, you can focus exclusively
>         on performance (coverage be damned).
>
>
>         Diego.
>
>         _______________________________________________
>         LLVM Developers mailing list
>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>         http://llvm.cs.uiuc.edu
>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://llvm.cs.uiuc.edu
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150819/9c29ede4/attachment.html>