[PATCH] Inliner Enhancement
Xinliang David Li
xinliangli at gmail.com
Tue Mar 17 23:25:16 PDT 2015
+Easwaran who is working on improving LLVM inliner with a more
sophisticated cost-model. The cases mentioned in your patch will be
covered. Profile (including static profile) data will also be used in the
On Tue, Mar 17, 2015 at 10:46 PM, Jiangning Liu <liujiangning1 at gmail.com>
> Hi chandlerc, apazos, yinma, hfinkel,
> Following the discussion in BOF session of LLVM dev meeting 2014, I did
> some experiments to enhance LLVM inliner and want to share my result at the
> moment. My major goal is to improve -O3 performance without profiling
> support, which should be the simplest scenario of using compiler
> Inlining more code usually could increase performance at the cost of code
> size bloat, but overly inlining code could increase register pressure and
> hurt performance, e.g. some more loop invariants can be detected and
> hoisted out of loop, and finally register pressure increases a lot. In the
> meantime, inline is expensive because we have to analyze every function in
> terms of every call site with different arguments to remove dead code as
> possible as we could. Therefore, the biggest challenge of Inlining problem
> is how we can make trade-off among performance improvement, code size bloat
> and compiler slowdown in a smart manner.
> 1. Design
> My design to address the issues described above is listed as below.
> (1) For performance, the main idea is enlarging inlining threshold
> heuristically for *hot* spots detected at compile-time. The codes with the
> following properties are usually *hot*,
> (1.a) callee Inside a loop. If callee can be inlined into a loop, we could
> probably expose more optimization opportunities. E.g. loop invariant hoist.
> And this solution is particularly useful to small loops, like having less
> than 2~3 BasicBlocks, because such a simple loop structure would be less
> possible to trigger register pressure issue.
> (1.b) callee with constant argument. For example, if the constant argument
> is used as a loop boundary, it could trigger completely different loop
> unrolling behavior, like full unroll or partial unroll.
> Solution (1.a) requires loop info. With current pass manager behavior,
> CallGraphSCCPass doesn’t allow to use getAnalysisUsage to obtain loop info,
> but we can define a lightweight LoopAnalyzer pass inside module
> SimpleInliner, and this pass can be implemented simply by calling
> LoopInfoBase and DominatorTree.
Chandler's new pass manager is designed to handle this.
> (2) For code size, we have two solutions,
> (2.a) It doesn't make sense to inine a lot of *cold* code. Since non-hot
> code can be treated as cold code, we can reduce the normal threshold. In
> the patch the default threshold for -O3 is changed from 275 to 240. This
> way, we could save code size a lot. The performance reduction caused by
> reducing default threshold could be compensated by increasing threshold for
> *hot* code inside loops.
> (2.b) It would be quite abnormal if a function call the same callee many
> times, even if they use different arguments, because this kind of code can
> easily refactored by loop. So we can avoid inlining the same callee many
> times if we find this case.
This simple heuristic is not always valid. For instance, the '[ ]' operator
for a container can be invoked many times with different argument. Inlining
them can potentially expose CSE opportunities across inline instances of
the same callee.
> (3) For compile time, it’s a big challenge, because loop info calculation
> is really expensive.
> (3.a) Don’t re-compute loop info every time callee is inlined, but only do
> it once we start to check the new callees introduced by inlining a callee.
> For example, A->B->C, and A->D->E. When analyzing caller A, if we decide to
> inline B into A, C will be exposed to A, and at this moment, we don’t
> re-compute loop info until checking A->D is completed, because the loop
> info about D won’t be affected after inlining B.
> (3.b) Solve A->B->C dilemma differently using early exit. For example, for
> call graph A->B->C, and A->B->D. When analyzing caller B, if A->B->C pass
> the ABC checking, i.e. C can be inlined into B, and (B+C) can be inlined
> into A as well, current algorithm will defer it until analyzing caller A.
> But if we get D inlined into B before checking caller A, the code size of B
> could increase, and finally fails to be inlined into A. (Hal has explained
> this problem previously using vector push_back case). It means A->B->C will
> be kept as it is eventually, although D is inlined into B. This is *not* a
> problem, but a heuristic choice, I think. For a lot of cpp program, there
> are a lot of small functions could trigger this ABC issue. But choosing
> B->D rather than A->B->C would hurt compile time, because it will check all
> of callees inside B, although ABC case is already detected. So we can early
> exit as soon as positive ABC case is detected, and then the new algorithm
> will inline B into A first, at the moment of analyzing caller A. And then C
> and D could both be inlined into A eventually.
For IPA, the loop info/loop tree representation can be trimmed to be much
leaner. Also it should support incremental update.
> In order to apply methods (2.b) and (3.a), we have to solve an inline
> analysis ordering issue. Current inliner analyzes call sites in an unstable
> order. For example, A->B1->C1, A->B2->C2, and A->B3->C3. The call site
> analysis order of analyzing caller A was B1, C1, B3, C3, B2, C2. Now I
> change the order to be B1, B2, B3, C1, C2, C3.
> 2. Benchmark
> Chandler previously mentioned SPEC benchmark is not a good candidate for
> measuring code size impact, so I use llvm bootstrap and chromium as the
> benchmarks for compile time and code size.
> On llvm revision r232011 (March 12), I got the following benchmark data,
> 1) Performance:
> SPEC 2000 geomean for AArch64: +1.24%
> SPEC 2006 geomean for AArch64: +0.3%
> 2) Code size:
> * SPEC 2000+2006: +2.68%
> * clang/llvm: +2.88%
> * Chromium: +2%~3%
> 3) Compile-time:
> * llvm bootstrap on x86: +1.8%
> * SPEC2006 build on x86: +2.7%
> rL LLVM
> EMAIL PREFERENCES
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-commits