[PATCH] Inliner Enhancement
liujiangning1 at gmail.com
Wed Mar 18 19:01:43 PDT 2015
Thanks for your feedback...
2015-03-18 14:25 GMT+08:00 Xinliang David Li <xinliangli at gmail.com>:
> +Easwaran who is working on improving LLVM inliner with a more
> sophisticated cost-model.
I'm not sure a more sophisticated cost model can work well for inlining or
not. Current inlining cost is to model code size impact only, so I assume
you are talking about the cost model for performance. Inliner happens at
very early stage of LLVM compilation, and it's really hard to give accurate
performance cost impact. For example, you would have to introduce register
cost model to consider register spill overhead, but I personally think is
it's hard to be accurate at such an early stage of LLVM compilation. Also,
if we want to get it accurate, we would have to pay compile-time cost. I
think compile-time is one of the advantages of LLVM over GCC we don't want
> The cases mentioned in your patch will be covered. Profile (including
> static profile) data will also be used in the analysis.
My patch doesn't tend to cover PGO.
> On Tue, Mar 17, 2015 at 10:46 PM, Jiangning Liu <liujiangning1 at gmail.com>
>> Hi chandlerc, apazos, yinma, hfinkel,
>> Following the discussion in BOF session of LLVM dev meeting 2014, I did
>> some experiments to enhance LLVM inliner and want to share my result at the
>> moment. My major goal is to improve -O3 performance without profiling
>> support, which should be the simplest scenario of using compiler
>> Inlining more code usually could increase performance at the cost of code
>> size bloat, but overly inlining code could increase register pressure and
>> hurt performance, e.g. some more loop invariants can be detected and
>> hoisted out of loop, and finally register pressure increases a lot. In the
>> meantime, inline is expensive because we have to analyze every function in
>> terms of every call site with different arguments to remove dead code as
>> possible as we could. Therefore, the biggest challenge of Inlining problem
>> is how we can make trade-off among performance improvement, code size bloat
>> and compiler slowdown in a smart manner.
>> 1. Design
>> My design to address the issues described above is listed as below.
>> (1) For performance, the main idea is enlarging inlining threshold
>> heuristically for *hot* spots detected at compile-time. The codes with the
>> following properties are usually *hot*,
>> (1.a) callee Inside a loop. If callee can be inlined into a loop, we
>> could probably expose more optimization opportunities. E.g. loop invariant
>> hoist. And this solution is particularly useful to small loops, like having
>> less than 2~3 BasicBlocks, because such a simple loop structure would be
>> less possible to trigger register pressure issue.
>> (1.b) callee with constant argument. For example, if the constant
>> argument is used as a loop boundary, it could trigger completely different
>> loop unrolling behavior, like full unroll or partial unroll.
>> Solution (1.a) requires loop info. With current pass manager behavior,
>> CallGraphSCCPass doesn’t allow to use getAnalysisUsage to obtain loop info,
>> but we can define a lightweight LoopAnalyzer pass inside module
>> SimpleInliner, and this pass can be implemented simply by calling
>> LoopInfoBase and DominatorTree.
> Chandler's new pass manager is designed to handle this.
>> (2) For code size, we have two solutions,
>> (2.a) It doesn't make sense to inine a lot of *cold* code. Since non-hot
>> code can be treated as cold code, we can reduce the normal threshold. In
>> the patch the default threshold for -O3 is changed from 275 to 240. This
>> way, we could save code size a lot. The performance reduction caused by
>> reducing default threshold could be compensated by increasing threshold for
>> *hot* code inside loops.
>> (2.b) It would be quite abnormal if a function call the same callee many
>> times, even if they use different arguments, because this kind of code can
>> easily refactored by loop. So we can avoid inlining the same callee many
>> times if we find this case.
> This simple heuristic is not always valid. For instance, the '[ ]'
> operator for a container can be invoked many times with different argument.
> Inlining them can potentially expose CSE opportunities across inline
> instances of the same callee.
Agree, but I never say it wouldn't hurt performance for some cases, and I
just want to cover the most reasonable scenarios. If this is the case, I
was hoping programmer change it by using loop.
>> (3) For compile time, it’s a big challenge, because loop info calculation
>> is really expensive.
>> (3.a) Don’t re-compute loop info every time callee is inlined, but only
>> do it once we start to check the new callees introduced by inlining a
>> callee. For example, A->B->C, and A->D->E. When analyzing caller A, if we
>> decide to inline B into A, C will be exposed to A, and at this moment, we
>> don’t re-compute loop info until checking A->D is completed, because the
>> loop info about D won’t be affected after inlining B.
>> (3.b) Solve A->B->C dilemma differently using early exit. For example,
>> for call graph A->B->C, and A->B->D. When analyzing caller B, if A->B->C
>> pass the ABC checking, i.e. C can be inlined into B, and (B+C) can be
>> inlined into A as well, current algorithm will defer it until analyzing
>> caller A. But if we get D inlined into B before checking caller A, the code
>> size of B could increase, and finally fails to be inlined into A. (Hal has
>> explained this problem previously using vector push_back case). It means
>> A->B->C will be kept as it is eventually, although D is inlined into B.
>> This is *not* a problem, but a heuristic choice, I think. For a lot of cpp
>> program, there are a lot of small functions could trigger this ABC issue.
>> But choosing B->D rather than A->B->C would hurt compile time, because it
>> will check all of callees inside B, although ABC case is already detected.
>> So we can early exit as soon as positive ABC case is detected, and then the
>> new algorithm will inline B into A first, at the moment of analyzing caller
>> A. And then C and D could both be inlined into A eventually.
> For IPA, the loop info/loop tree representation can be trimmed to be much
> leaner. Also it should support incremental update.
I'm not clear the IPA you are talking about is crossing modules or not. Can
>> In order to apply methods (2.b) and (3.a), we have to solve an inline
>> analysis ordering issue. Current inliner analyzes call sites in an unstable
>> order. For example, A->B1->C1, A->B2->C2, and A->B3->C3. The call site
>> analysis order of analyzing caller A was B1, C1, B3, C3, B2, C2. Now I
>> change the order to be B1, B2, B3, C1, C2, C3.
>> 2. Benchmark
>> Chandler previously mentioned SPEC benchmark is not a good candidate for
>> measuring code size impact, so I use llvm bootstrap and chromium as the
>> benchmarks for compile time and code size.
>> On llvm revision r232011 (March 12), I got the following benchmark data,
>> 1) Performance:
>> SPEC 2000 geomean for AArch64: +1.24%
>> SPEC 2006 geomean for AArch64: +0.3%
>> 2) Code size:
>> * SPEC 2000+2006: +2.68%
>> * clang/llvm: +2.88%
>> * Chromium: +2%~3%
>> 3) Compile-time:
>> * llvm bootstrap on x86: +1.8%
>> * SPEC2006 build on x86: +2.7%
>> rL LLVM
>> EMAIL PREFERENCES
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-commits