[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

Fri Jul 31 14:25:38 PDT 2015

Just nitpicking:
1) DI(F) should include a component that estimate the epi/prologue cost
(frameSetupCost) which InlinedDF does not have
2) The speedup should include callsite cost associated with 'C' (call
instr, argument passing):
   Speedup(F,C) =  (DI(F) + CallCost(C) - InlinedDF(F,C))/DI(F).

Otherwise the proposal looks reasonable to me.

David

On Thu, Jul 30, 2015 at 2:25 PM, Easwaran Raman <eraman at google.com> wrote:

> TLDR - The proposal below is intended to allow inlining of larger callees
> when such inlining is expected to reduce the dynamic instructions count.
>
> Proposal
> -------------
>
>  LLVM inlines a function if the size growth (in the given context) is less
> than a threshold. The threshold is increased based on certain
> characteristics of the called function (inline keyword and the fraction of
> vector instructions, for example). I propose the use of estimated speedup
> (estimated reduction in dynamic instruction count to be precise)  as
> another factor that controls threshold. This would allow larger functions
> whose inlining potentially reduces execution time to be inlined.
>
> The dynamic instruction count of (an uninlined) function F is
> DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
>
> * The above summation is over all basic blocks in F.
> * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
>
> This dynamic instruction count measurement doesn't distinguish between a
> single-cycle instruction and a long latency instruction. Instead of
> using InstructionCount(BB)), we could use Sum_I(Weight(I)) where the
> summation is over all instructions I in B and Weight(I) represents the time
> it takes to execute instruction I.
>
> The dynamic instruction count of F into a callsite C after inlining
> InlinedDI(F, C) can be similary computed taking into account the
> instructions that are simplified due to inlining.  The estimated speedup is
> Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
>
> Speedup above a pre-determined threshold implies there is an expected
> benefit in inlining the callee F and hence a bonus may be applied to the
> associated threshold at that callsite.
>
> Details
> ----------
> This proposal is dependent on the new pass manager that would allow inline
> cost analysis to see function level analysis passes. The outlined function
> dynamic instruction count can be provided by an analysis pass. This dynamic
> instruction count and the block frequency can be either updated
> (preferable, imo) or recomputed after each inlining.
>
> I prototyped a version of this (the prototype augments the BasicBlock and
> Function classes to store the block frequency and function execution times)
> and did some initial evaluation with clang and some internal benchmarks
> used at Google. Implementing it as described above resulted in a large size
> growth when the parameters are chosen to deliver performance improvement.
> Some ways to control size growth include applying the heuristic only for
> hot callsites, only for inline functions, and measuring the speedup using
> both caller and callee time (instead of just the latter). In any case,
> without profile feedback it is very likely that there is going to be a size
> increase when this heuristic is triggered on cold modules. With profile
> feedback, the absolute count of the callsite can be used to limit this to
> only hot callsites.
>
> Alternative approach
> --------------------------
> An alternative approach is to set the thresholds proportional to the
> estimated speedup, instead of just having a low and high thresholds (for
> callees whose estimated speedup is below and above the cutoff,
> respectively). In my opinion such an approach adds complexity in
> performance debugging and tuning. My guess is this is not going to bring
> significant additional improvement over a well-tuned simple approach, but
> perhaps worth exploring later.
>
> Preliminary Results
> ---------------------------
> With the caveat that the prototype implementation is not well validated
> and very little tuning has been done, I have some preliminary numbers.
> These were obtained by using a very aggressive 12X bonus for estimated
> speedup of 15% or more.
>
> * Clang (run with -O2 on a large preprocessed file): 1% performance
> improvement and a 4.6% text size increase.
> * Google benchmark suite (geomean of ~20 benchmarks): 5.5% performance
> improvement and a 7.7% text size increase
> * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size increase
> * Chrome: Performance neutral, 3.8% text size increase
>
> The I propose to implement this change guarded by a flag. This flag could
> be turned on in O2 with profile guided optimization. If size increase under
> -O3 is not a huge concern, this could be turned on in -O3 even without PGO.
>
> Looking forward to your feedback.
>
> Thanks,
> Easwaran
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150731/014cfa55/attachment.html>