[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

Fri Jul 31 14:36:45 PDT 2015

On Fri, Jul 31, 2015 at 2:25 PM, Xinliang David Li <xinliangli at gmail.com>
wrote:

> Just nitpicking:
> 1) DI(F) should include a component that estimate the epi/prologue cost
> (frameSetupCost) which InlinedDF does not have
>
 The cost of frame setup is accounted for in the existing inline cost
computation. For each argument, it adds a negative cost since the
instructions setting them up will go after inlining. Since I was
piggybacking on this (InlinedDI will be actual dynamic instructions minus
the dynamic instructions due to argument setup), I didn't mention this
explicitly, but yes, I agree that this has to be accounted.

2) The speedup should include callsite cost associated with 'C' (call
> instr, argument passing):
>    Speedup(F,C) =  (DI(F) + CallCost(C) - InlinedDF(F,C))/DI(F).
>
> Agreed.

> Otherwise the proposal looks reasonable to me.
>
> David
>
>
> On Thu, Jul 30, 2015 at 2:25 PM, Easwaran Raman <eraman at google.com> wrote:
>
>> TLDR - The proposal below is intended to allow inlining of larger callees
>> when such inlining is expected to reduce the dynamic instructions count.
>>
>> Proposal
>> -------------
>>
>>  LLVM inlines a function if the size growth (in the given context) is
>> less than a threshold. The threshold is increased based on certain
>> characteristics of the called function (inline keyword and the fraction of
>> vector instructions, for example). I propose the use of estimated speedup
>> (estimated reduction in dynamic instruction count to be precise)  as
>> another factor that controls threshold. This would allow larger functions
>> whose inlining potentially reduces execution time to be inlined.
>>
>> The dynamic instruction count of (an uninlined) function F is
>> DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
>>
>> * The above summation is over all basic blocks in F.
>> * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
>>
>> This dynamic instruction count measurement doesn't distinguish between a
>> single-cycle instruction and a long latency instruction. Instead of
>> using InstructionCount(BB)), we could use Sum_I(Weight(I)) where the
>> summation is over all instructions I in B and Weight(I) represents the time
>> it takes to execute instruction I.
>>
>> The dynamic instruction count of F into a callsite C after inlining
>> InlinedDI(F, C) can be similary computed taking into account the
>> instructions that are simplified due to inlining.  The estimated speedup is
>> Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
>>
>> Speedup above a pre-determined threshold implies there is an expected
>> benefit in inlining the callee F and hence a bonus may be applied to the
>> associated threshold at that callsite.
>>
>> Details
>> ----------
>> This proposal is dependent on the new pass manager that would allow
>> inline cost analysis to see function level analysis passes. The outlined
>> function dynamic instruction count can be provided by an analysis pass.
>> This dynamic instruction count and the block frequency can be either
>> updated (preferable, imo) or recomputed after each inlining.
>>
>> I prototyped a version of this (the prototype augments the BasicBlock and
>> Function classes to store the block frequency and function execution times)
>> and did some initial evaluation with clang and some internal benchmarks
>> used at Google. Implementing it as described above resulted in a large size
>> growth when the parameters are chosen to deliver performance improvement.
>> Some ways to control size growth include applying the heuristic only for
>> hot callsites, only for inline functions, and measuring the speedup using
>> both caller and callee time (instead of just the latter). In any case,
>> without profile feedback it is very likely that there is going to be a size
>> increase when this heuristic is triggered on cold modules. With profile
>> feedback, the absolute count of the callsite can be used to limit this to
>> only hot callsites.
>>
>> Alternative approach
>> --------------------------
>> An alternative approach is to set the thresholds proportional to the
>> estimated speedup, instead of just having a low and high thresholds (for
>> callees whose estimated speedup is below and above the cutoff,
>> respectively). In my opinion such an approach adds complexity in
>> performance debugging and tuning. My guess is this is not going to bring
>> significant additional improvement over a well-tuned simple approach, but
>> perhaps worth exploring later.
>>
>> Preliminary Results
>> ---------------------------
>> With the caveat that the prototype implementation is not well validated
>> and very little tuning has been done, I have some preliminary numbers.
>> These were obtained by using a very aggressive 12X bonus for estimated
>> speedup of 15% or more.
>>
>> * Clang (run with -O2 on a large preprocessed file): 1% performance
>> improvement and a 4.6% text size increase.
>> * Google benchmark suite (geomean of ~20 benchmarks): 5.5% performance
>> improvement and a 7.7% text size increase
>> * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size increase
>> * Chrome: Performance neutral, 3.8% text size increase
>>
>> The I propose to implement this change guarded by a flag. This flag could
>> be turned on in O2 with profile guided optimization. If size increase under
>> -O3 is not a huge concern, this could be turned on in -O3 even without PGO.
>>
>> Looking forward to your feedback.
>>
>> Thanks,
>> Easwaran
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150731/d92f642c/attachment.html>