[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

Thu Jul 30 14:25:41 PDT 2015

TLDR - The proposal below is intended to allow inlining of larger callees
when such inlining is expected to reduce the dynamic instructions count.

Proposal
-------------

 LLVM inlines a function if the size growth (in the given context) is less
than a threshold. The threshold is increased based on certain
characteristics of the called function (inline keyword and the fraction of
vector instructions, for example). I propose the use of estimated speedup
(estimated reduction in dynamic instruction count to be precise)  as
another factor that controls threshold. This would allow larger functions
whose inlining potentially reduces execution time to be inlined.

The dynamic instruction count of (an uninlined) function F is
DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))

* The above summation is over all basic blocks in F.
* Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))

This dynamic instruction count measurement doesn't distinguish between a
single-cycle instruction and a long latency instruction. Instead of
using InstructionCount(BB)), we could use Sum_I(Weight(I)) where the
summation is over all instructions I in B and Weight(I) represents the time
it takes to execute instruction I.

The dynamic instruction count of F into a callsite C after inlining
InlinedDI(F, C) can be similary computed taking into account the
instructions that are simplified due to inlining.  The estimated speedup is
Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)

Speedup above a pre-determined threshold implies there is an expected
benefit in inlining the callee F and hence a bonus may be applied to the
associated threshold at that callsite.

Details
----------
This proposal is dependent on the new pass manager that would allow inline
cost analysis to see function level analysis passes. The outlined function
dynamic instruction count can be provided by an analysis pass. This dynamic
instruction count and the block frequency can be either updated
(preferable, imo) or recomputed after each inlining.

I prototyped a version of this (the prototype augments the BasicBlock and
Function classes to store the block frequency and function execution times)
and did some initial evaluation with clang and some internal benchmarks
used at Google. Implementing it as described above resulted in a large size
growth when the parameters are chosen to deliver performance improvement.
Some ways to control size growth include applying the heuristic only for
hot callsites, only for inline functions, and measuring the speedup using
both caller and callee time (instead of just the latter). In any case,
without profile feedback it is very likely that there is going to be a size
increase when this heuristic is triggered on cold modules. With profile
feedback, the absolute count of the callsite can be used to limit this to
only hot callsites.

Alternative approach
--------------------------
An alternative approach is to set the thresholds proportional to the
estimated speedup, instead of just having a low and high thresholds (for
callees whose estimated speedup is below and above the cutoff,
respectively). In my opinion such an approach adds complexity in
performance debugging and tuning. My guess is this is not going to bring
significant additional improvement over a well-tuned simple approach, but
perhaps worth exploring later.

Preliminary Results
---------------------------
With the caveat that the prototype implementation is not well validated and
very little tuning has been done, I have some preliminary numbers. These
were obtained by using a very aggressive 12X bonus for estimated speedup of
15% or more.

* Clang (run with -O2 on a large preprocessed file): 1% performance
improvement and a 4.6% text size increase.
* Google benchmark suite (geomean of ~20 benchmarks): 5.5% performance
improvement and a 7.7% text size increase
* Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size increase
* Chrome: Performance neutral, 3.8% text size increase

The I propose to implement this change guarded by a flag. This flag could
be turned on in O2 with profile guided optimization. If size increase under
-O3 is not a huge concern, this could be turned on in -O3 even without PGO.

Looking forward to your feedback.

Thanks,
Easwaran
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150730/0599eccc/attachment.html>