[llvm-dev] [LLVMdev] RFC: Callee speedup estimation in inline cost analysis

Tue Aug 4 23:51:26 PDT 2015

----- Original Message -----
> From: "Easwaran Raman" <eraman at google.com>
> To: "<llvmdev at cs.uiuc.edu> List" <llvmdev at cs.uiuc.edu>
> Sent: Thursday, July 30, 2015 4:25:41 PM
> Subject: [LLVMdev] RFC: Callee speedup estimation in inline cost analysis
> 
> 
> 
> TLDR - The proposal below is intended to allow inlining of larger
> callees when such inlining is expected to reduce the dynamic
> instructions count.
> 
> 
> 
> Proposal
> -------------
> 
> 
> LLVM inlines a function if the size growth (in the given context) is
> less than a threshold. The threshold is increased based on certain
> characteristics of the called function (inline keyword and the
> fraction of vector instructions, for example). I propose the use of
> estimated speedup (estimated reduction in dynamic instruction count
> to be precise) as another factor that controls threshold. This would
> allow larger functions whose inlining potentially reduces execution
> time to be inlined.
> 
> 
> The dynamic instruction count of (an uninlined) function F is
> DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
> 
> 
> * The above summation is over all basic blocks in F.
> * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
> 
> 
> 
> This dynamic instruction count measurement doesn't distinguish
> between a single-cycle instruction and a long latency instruction.
> Instead of using InstructionCount(BB)), we could use
> Sum_I(Weight(I)) where the summation is over all instructions I in B
> and Weight(I) represents the time it takes to execute instruction I.
> 
> 
> The dynamic instruction count of F into a callsite C after inlining
> InlinedDI(F, C) can be similary computed taking into account the
> instructions that are simplified due to inlining.

Are you computing this cost in the current fashion, or using some other mechanism? As I recall, we currently limit the number of instructions visited here to prevent the overall analysis from being quadratic. This seems somewhat at odds with a change specifically targeted at large callees.

> The estimated
> speedup is
> 
> Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
> 
> 
> Speedup above a pre-determined threshold implies there is an expected
> benefit in inlining the callee F and hence a bonus may be applied to
> the associated threshold at that callsite.
> 
> 
> Details
> ----------
> This proposal is dependent on the new pass manager that would allow
> inline cost analysis to see function level analysis passes. The
> outlined function dynamic instruction count can be provided by an
> analysis pass. This dynamic instruction count and the block
> frequency can be either updated (preferable, imo) or recomputed
> after each inlining.
> 
> 
> 
> I prototyped a version of this (the prototype augments the BasicBlock
> and Function classes to store the block frequency and function
> execution times) and did some initial evaluation with clang and some
> internal benchmarks used at Google. Implementing it as described
> above resulted in a large size growth when the parameters are chosen
> to deliver performance improvement. Some ways to control size growth
> include applying the heuristic only for hot callsites, only for
> inline functions, and measuring the speedup using both caller and
> callee time (instead of just the latter). In any case, without
> profile feedback it is very likely that there is going to be a size
> increase when this heuristic is triggered on cold modules. With
> profile feedback, the absolute count of the callsite can be used to
> limit this to only hot callsites.
> 
> 
> 
> Alternative approach
> --------------------------
> An alternative approach is to set the thresholds proportional to the
> estimated speedup, instead of just having a low and high thresholds
> (for callees whose estimated speedup is below and above the cutoff,
> respectively). In my opinion such an approach adds complexity in
> performance debugging and tuning. My guess is this is not going to
> bring significant additional improvement over a well-tuned simple
> approach, but perhaps worth exploring later.
> 
> 
> Preliminary Results
> ---------------------------
> With the caveat that the prototype implementation is not well
> validated and very little tuning has been done, I have some
> preliminary numbers. These were obtained by using a very aggressive
> 12X bonus for estimated speedup of 15% or more.
> 
> 
> * Clang (run with -O2 on a large preprocessed file): 1% performance
> improvement and a 4.6% text size increase.
> * Google benchmark suite (geomean of ~20 benchmarks): 5.5%
> performance improvement and a 7.7% text size increase
> 
> * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size
> increase
> 
> * Chrome: Performance neutral, 3.8% text size increase

What was the compile-time effect?

> 
> 
> The I propose to implement this change guarded by a flag. This flag
> could be turned on in O2 with profile guided optimization. If size
> increase under -O3 is not a huge concern, this could be turned on in
> -O3 even without PGO.
> 

For my users, such an increase would be generally acceptable, however, others might have a different opinion. I'd certainly support a -O4 level with more-aggressive inlining in that case.

 -Hal

> 
> Looking forward to your feedback.
> 
> 
> Thanks,
> Easwaran
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory