[PATCH] Inliner Enhancement

Thu Mar 19 11:49:55 PDT 2015

On Wed, Mar 18, 2015 at 7:01 PM, Jiangning Liu <liujiangning1 at gmail.com>
wrote:

> Hi David,
>
> Thanks for your feedback...
>
> 2015-03-18 14:25 GMT+08:00 Xinliang David Li <xinliangli at gmail.com>:
>
>> +Easwaran who is working on improving LLVM inliner with a more
>> sophisticated cost-model.
>>
>
> I'm not sure a more sophisticated cost model can work well for inlining or
> not. Current inlining cost is to model code size impact only, so I assume
> you are talking about the cost model for performance. Inliner happens at
> very early stage of LLVM compilation, and it's really hard to give accurate
> performance cost impact. For example, you would have to introduce register
> cost model to consider register spill overhead, but I personally think is
> it's hard to be accurate at such an early stage of LLVM compilation. Also,
> if we want to get it accurate, we would have to pay compile-time cost. I
> think compile-time is one of the advantages of LLVM over GCC we don't want
> to lose.
>
>
Yes, it is hard (or impossible) to have an accurate cost model at the time
of inlining, but a model with low false positives can be used. For
instance, by computing a weighted (the weight being normalized block
frequency) instruction count before and after inlining we can estimate how
effective inlining is likely to be and use it to give a larger size
threshold for a callsite. Since we visit the instructions of a callee at
every callsite, the incremental compilation time for computing the weighted
instruction count will be less.

- Easwaran

> The cases mentioned in your patch will be covered. Profile (including
>> static profile) data will also be used in the analysis.
>>
>
> My patch doesn't tend to cover PGO.
>
>
>>
>> On Tue, Mar 17, 2015 at 10:46 PM, Jiangning Liu <liujiangning1 at gmail.com>
>> wrote:
>>
>>> Hi chandlerc, apazos, yinma, hfinkel,
>>>
>>> Following the discussion in BOF session of LLVM dev meeting 2014, I did
>>> some experiments to enhance LLVM inliner and want to share my result at the
>>> moment. My major goal is to improve -O3 performance without profiling
>>> support, which should be the simplest scenario of using compiler
>>> optimization.
>>>
>>> Inlining more code usually could increase performance at the cost of
>>> code size bloat, but overly inlining code could increase register pressure
>>> and hurt performance, e.g. some more loop invariants can be detected and
>>> hoisted out of loop, and finally register pressure increases a lot. In the
>>> meantime, inline is expensive because we have to analyze every function in
>>> terms of every call site with different arguments to remove dead code as
>>> possible as we could. Therefore, the biggest challenge of Inlining problem
>>> is how we can make trade-off among performance improvement, code size bloat
>>> and compiler slowdown in a smart manner.
>>>
>>> 1. Design
>>>
>>> My design to address the issues described above is listed as below.
>>>
>>> (1) For performance, the main idea is enlarging inlining threshold
>>> heuristically for *hot* spots detected at compile-time. The codes with the
>>> following properties are usually *hot*,
>>> (1.a) callee Inside a loop. If callee can be inlined into a loop, we
>>> could probably expose more optimization opportunities. E.g. loop invariant
>>> hoist. And this solution is particularly useful to small loops, like having
>>> less than 2~3 BasicBlocks, because such a simple loop structure would be
>>> less possible to trigger register pressure issue.
>>> (1.b) callee with constant argument. For example, if the constant
>>> argument is used as a loop boundary, it could trigger completely different
>>> loop unrolling behavior, like full unroll or partial unroll.
>>>
>>> Solution (1.a) requires loop info. With current pass manager behavior,
>>> CallGraphSCCPass doesn’t allow to use getAnalysisUsage to obtain loop info,
>>> but we can define a lightweight LoopAnalyzer pass inside module
>>> SimpleInliner, and this pass can be implemented simply by calling
>>> LoopInfoBase and DominatorTree.
>>>
>>
>> Chandler's new pass manager is designed to handle this.
>>
>
>>
>>>
>>> (2) For code size, we have two solutions,
>>> (2.a) It doesn't make sense to inine a lot of *cold* code. Since non-hot
>>> code can be treated as cold code, we can reduce the normal threshold. In
>>> the patch the default threshold for -O3 is changed from 275 to 240. This
>>> way, we could save code size a lot. The performance reduction caused by
>>> reducing default threshold could be compensated by increasing threshold for
>>> *hot* code inside loops.
>>> (2.b) It would be quite abnormal if a function call the same callee many
>>> times, even if they use different arguments, because this kind of code can
>>> easily refactored by loop. So we can avoid inlining the same callee many
>>> times if we find this case.
>>>
>>
>> This simple heuristic is not always valid. For instance, the '[ ]'
>> operator for a container can be invoked many times with different argument.
>> Inlining them can potentially expose CSE opportunities across inline
>> instances of the same callee.
>>
>
> Agree, but I never say it wouldn't hurt performance for some cases, and I
> just want to cover the most reasonable scenarios. If this is the case, I
> was hoping programmer change it by using loop.
>
>
>>
>>
>>>
>>> (3) For compile time, it’s a big challenge, because loop info
>>> calculation is really expensive.
>>> (3.a) Don’t re-compute loop info every time callee is inlined, but only
>>> do it once we start to check the new callees introduced by inlining a
>>> callee. For example, A->B->C, and A->D->E. When analyzing caller A, if we
>>> decide to inline B into A, C will be exposed to A, and at this moment, we
>>> don’t re-compute loop info until checking A->D is completed, because the
>>> loop info about D won’t be affected after inlining B.
>>> (3.b) Solve A->B->C dilemma differently using early exit. For example,
>>> for call graph A->B->C, and A->B->D. When analyzing caller B, if A->B->C
>>> pass the ABC checking, i.e. C can be inlined into B, and (B+C) can be
>>> inlined into A as well, current algorithm will defer it until analyzing
>>> caller A. But if we get D inlined into B before checking caller A, the code
>>> size of B could increase, and finally fails to be inlined into A. (Hal has
>>> explained this problem previously using vector push_back case). It means
>>> A->B->C will be kept as it is eventually, although D is inlined into B.
>>> This is *not* a problem, but a heuristic choice, I think. For a lot of cpp
>>> program, there are a lot of small functions could trigger this ABC issue.
>>> But choosing B->D rather than A->B->C would hurt compile time, because it
>>> will check all of callees inside B, although ABC case is already detected.
>>> So we can early exit as soon as positive ABC case is detected, and then the
>>> new algorithm will inline B into A first, at the moment of analyzing caller
>>> A. And then C and D could both be inlined into A eventually.
>>>
>>
>> For IPA, the loop info/loop tree representation can be trimmed to be much
>> leaner. Also it should support incremental update.
>>
>
> I'm not clear the IPA you are talking about is crossing modules or not.
> Can you clarify?
>
>
>> David
>>
>>
>>
>>
>>>
>>> In order to apply methods (2.b) and (3.a), we have to solve an inline
>>> analysis ordering issue. Current inliner analyzes call sites in an unstable
>>> order. For example, A->B1->C1, A->B2->C2, and A->B3->C3. The call site
>>> analysis order of analyzing caller A was B1, C1, B3, C3, B2, C2. Now I
>>> change the order to be B1, B2, B3, C1, C2, C3.
>>>
>>> 2. Benchmark
>>>
>>> Chandler previously mentioned SPEC benchmark is not a good candidate for
>>> measuring code size impact, so I use llvm bootstrap and chromium as the
>>> benchmarks for compile time and code size.
>>>
>>> On llvm revision r232011 (March 12), I got the following benchmark data,
>>>
>>> 1) Performance:
>>> SPEC 2000 geomean for AArch64: +1.24%
>>> SPEC 2006 geomean for AArch64: +0.3%
>>> 2) Code size:
>>> * SPEC 2000+2006: +2.68%
>>> * clang/llvm: +2.88%
>>> * Chromium: +2%~3%
>>> 3) Compile-time:
>>> * llvm bootstrap on x86: +1.8%
>>> * SPEC2006 build on x86: +2.7%
>>>
>>> Thanks,
>>> -Jiangning
>>>
>>> REPOSITORY
>>>   rL LLVM
>>>
>>> http://reviews.llvm.org/D8408
>>>
>>> Files:
>>>   include/llvm/Transforms/IPO/InlinerPass.h
>>>   lib/Transforms/IPO/InlineSimple.cpp
>>>   lib/Transforms/IPO/Inliner.cpp
>>>   test/Transforms/Inline/inline-loop.ll
>>>   test/Transforms/Inline/inline-misc.ll
>>>
>>> EMAIL PREFERENCES
>>>   http://reviews.llvm.org/settings/panel/emailpreferences/
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150319/46a05f02/attachment.html>