[llvm-dev] RFC: PGO Late instrumentation for LLVM
Philip Reames via llvm-dev
llvm-dev at lists.llvm.org
Wed Aug 19 17:18:18 PDT 2015
Thank you for sharing the data. I haven't been following the
discussion, but this data made for very interesting reading on it's own.
Philip
On 08/19/2015 03:39 PM, Rong Xu via llvm-dev wrote:
> We collected more data to address some of the questions from the
> reviewers. Note this time we use clang itself as the benchmark. We
> choose clang because we think it's a typical C++ program and the
> reviewers here have good knowledge of the code base.
>
> What we measure is running time for clang to compile a large
> preprocessed source file (4.98M lines of .ii file), using different
> compilation modes. All the numbers reported here are the average
> running time of 5 runs in seconds.
>
> *(1) Performance b/w late instrumentation v.s. not instrumenting
> single BB functions*
>
> We first compare various instrumentation performance.
> ----------------------------------------------------------------------------
> Config wall_time_for_instr ratio_vs_base profile_size
> (1) base O2 80.386 100.0% --
> (2) FE-based Instr 201.658 250.8% 65238880
> (3) late Instr 103.662 129.0% 14860144
> (4) (3) + w/o pre-inline 199.924 248.7% 70762720
> (5) (4) + Silva 119.904 149.2% 24499528
>
> Config(5) used the simple heuristic that Sean Silva proposed: not
> instrumenting single BB functions that contain less than 10
> instructions (excluding debug and phi stmts).
>
> We can see:
> 1) Simple heuristic of not instrumenting small single BB functions
> improves instrumentation performance as expected.
> 2) Using simple heuristic is still slower than late instrumentation
> with pre-inlining: the later is 15% faster.
> 3) Late instrumentation produces the smallest profile size: it's 39%
> smaller than using the simple heuristic.
>
> The result is expected as pre-inlining can handle more cases than the
> simple heuristic. There is significant performance gap between the
> simple heuristic (5) and late instrumentation (2).
>
> We also used a few larger internal benchmarks to further validate the
> above result. The following table shows the slowdown compared to the
> base O2. The labels (2) to (5) refer to the same config as in the
> previous table.
> ------------------------------------------------------
> Program (2) (3) (4) (5)
> C++benchmark16 -45.24% -12.93% -43.84% -24.74%
> C++benchmark17 -90.86% -58.19% -87.77% -80.62%
> C++benchmark18 -95.32% -54.75% -91.21% -82.56%
>
>
> We can see the same trend as the clang benchmark: the simple heuristic
> (5) recovers a lot of performance loss compared with FE base
> instrumentation, but is still significantly worse than late
> instrumentation (3).
>
> *(2) Performance impact of context sensitivity*
>
> LLVM does not use the profile information fully in the back-end
> optimizations, for instance, inlining does not fully use the profile
> counts -- it only marks hot/cold function attribute based on function
> entry counts. To evaluate the impact of profile context sensitivity,
> GCC is used in the experiment. Note that GCC PGO improves clang
> performance a lot more than clang PGO.
>
> First we summarize the methodology used in the experiment:
> 0) build clang with GCC O2 without early inlining and measure clang's
> performance. GCC early inlining (einline) is similar to pre-inline
> used by late instrumentation.
> 1) build clang with GCC O2 with early inlining and measure performance.
>
> The performance difference of 1) and 0) is denoted as E which measures
> the contribution of early inlining.
>
> 2) build clang with GCC O2 + PGO without early inlining.
> 3) build clang with GCC O2 + PGO with early inlining.
>
> The performance difference of 3) and 2) is denoted as EC. It
> constitutes roughly two parts a) early inlining contribution b)
> context sensitive profiling enabled with early inlining.
>
> The contribution of context sensitive profiling can be estimated by EC
> - E above.
> -------------------------------------------------------------------------------
> Config wall_time_for_use speedup_vs_(0) speedup_vs_(1)
> (0) base w/o einline 84.946 1.000 0.934
> (1) base O2 79.310 1.071 1.000
> (2) profile-arcs w/o einline 63.518 1.337 1.249
> (3) profile-arcs 48.364 1.756 1.640
>
> We see the following:
> 1) GCC PGO with early inlining improves clang performance by 64.0%
> (v.s. base O2 w/ early inline).
> 2) GCC PGO w/o early inlining improves clang performance by 33.7%
> (v.s. base O2 w/o early inline).
> 3) Early inlining performance contribution is about 7.1%.
> 4) Profile context sensitivity contribution is estimated to be 22.2%
> (i.e. 64.0% -33.7% - 7.1%), which is pretty significant.
>
> *(3) Pre-inline pass impact on the value profiling*
>
> Again, we use GCC as the platform to estimate:
>
> --------------------------------------------------------
> Config wall_time for_instr
> (2) profile-arcs 115.720
> (3) profile-arcs w/o einline 310.560
> (4) profile-generate 139.952
> (5) profile-generate w/o einline 680.910
>
> In GCC, -fprofile-generate does -fprofile-arcs as well as the value
> profiling. The above table shows that with value profile, the impact
> of pre-inlining is even larger for instrumented binary performance.
> Without value profiling, disabling pre-inlining increases runtime by
> 1.7x, while with value profiling, its impact is 3.9x increase in runtime.
>
>
> On Tue, Aug 11, 2015 at 10:11 PM, Sean Silva via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>
> On Tue, Aug 11, 2015 at 11:07 AM, Diego Novillo via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
> One aspect of this that I have not seen discussed is that
> middle-end instrumentation enables PGO optimizations to
> front-ends other than Clang.
>
> While I agree that FE instrumentation could be improved, it
> still requires every FE to implement essentially the same
> common functionality. Having PGO instrumentation generated in
> the middle-end, allows us every FE to automatically take
> advantage of PGO.
>
>
> This is a really good point, and I agree with it. We may have
> gotten off on the wrong foot since Rong's email focused so heavily
> on comparing with the frontend instrumentation. As far as I see
> it, Rong's proposal has a couple different parts:
>
> 1. Infrastructure for IR-level instrumentation-based PGO
> 2. Changes to the pass pipeline so that a hypothetical IR-level
> instrumentation-based PGO is more effective
> 3. MST algorithm with profile feedback for optimal placement of
> counter updates.
>
> I think 1. is a no-brainer, if only so that all LLVM clients can
> benefit from PGO, and also (as you pointed out below) so that it
> can have an exclusive focus on performance. If it is sufficiently
> flexible, it may even make sense to restrict clang's frontend
> instrumentation-based profiling to non-performance stuff, and have
> clang directly interoperate with the IR-level PGO for
> performance-related PGO use cases, just like any other frontend would.
>
> Philip and Sanjoy, out of curiosity do you guys use your own
> instrumentation placement for PGO? Is an IR-level PGO
> infrastructure upstream something you guys would be interested in?
>
> I think that 2. is something that once we have 1. we will be able
> to evaluate better, but for now my opinion is that we should be
> able to make good progress without digging into that.
>
> I think that 3. is a no-brainer if it provides a really
> significant win, but without 1. we can't really measure its effect
> in isolation. It also has a usability problem since it requires
> feeding in an existing profile for the *instrumented* build, but
> if the benefit is very significant this may be worth it for some
> users. We will probably be able to easily refactor 1. as needed
> into an MST approach that degrades gracefully to using static
> heuristics in the absence of real profile information, so is not a
> maintenance burden (maybe even helps by providing a good framework
> in which to develop effective static heuristics).
>
> For the time being, I think we can avoid discussion of 2. and 3.
> until we have more of 1. working. So I think it would be most
> productive if we focus this discussion on 1.
>
>
> Additionally, some of the overhead imposed by FE
> instrumentation is not really all that easy to get rid of.
> You end up duplicating functionality that is more naturally
> implemented in the middle end.
>
>
> Yeah, I was looking into a couple of other simple approaches and
> quickly found out that I was basically replicating much of the
> sort of logic that the inliner already has.
>
> -- Sean Silva
>
>
> I see the two approaches as supplementary, rather than
> complementary. One does not negate the other. Some of the
> optimizations we'd do in the FE, may hurt coverage. Instead,
> by instrumenting in the middle end, you can focus exclusively
> on performance (coverage be damned).
>
>
> Diego.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150819/9c29ede4/attachment.html>
More information about the llvm-dev
mailing list