[llvm-dev] [LLVMdev] RFC: Callee speedup estimation in inline cost analysis

Tue Aug 18 21:40:45 PDT 2015

----- Original Message ----- 

> From: "Easwaran Raman" <eraman at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-dev at lists.llvm.org
> Sent: Wednesday, August 5, 2015 5:08:34 PM
> Subject: Re: [LLVMdev] RFC: Callee speedup estimation in inline cost
> analysis

> On Tue, Aug 4, 2015 at 11:51 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> ----- Original Message -----
> > From: "Easwaran Raman" < eraman at google.com >
> > To: "< llvmdev at cs.uiuc.edu > List" < llvmdev at cs.uiuc.edu >
> > Sent: Thursday, July 30, 2015 4:25:41 PM
> > Subject: [LLVMdev] RFC: Callee speedup estimation in inline cost
> > analysis
> >
> >
> >
> > TLDR - The proposal below is intended to allow inlining of larger
> > callees when such inlining is expected to reduce the dynamic
> > instructions count.
> >
> >
> >
> > Proposal
> > -------------
> >
> >
> > LLVM inlines a function if the size growth (in the given context)
> > is
> > less than a threshold. The threshold is increased based on certain
> > characteristics of the called function (inline keyword and the
> > fraction of vector instructions, for example). I propose the use of
> > estimated speedup (estimated reduction in dynamic instruction count
> > to be precise) as another factor that controls threshold. This
> > would
> > allow larger functions whose inlining potentially reduces execution
> > time to be inlined.
> >
> >
> > The dynamic instruction count of (an uninlined) function F is
> > DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
> >
> >
> > * The above summation is over all basic blocks in F.
> > * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
> >
> >
> >
> > This dynamic instruction count measurement doesn't distinguish
> > between a single-cycle instruction and a long latency instruction.
> > Instead of using InstructionCount(BB)), we could use
> > Sum_I(Weight(I)) where the summation is over all instructions I in
> > B
> > and Weight(I) represents the time it takes to execute instruction
> > I.
> >
> >
> > The dynamic instruction count of F into a callsite C after inlining
> > InlinedDI(F, C) can be similary computed taking into account the
> > instructions that are simplified due to inlining.

> Are you computing this cost in the current fashion, or using some
> other mechanism? As I recall, we currently limit the number of
> instructions visited here to prevent the overall analysis from being
> quadratic. This seems somewhat at odds with a change specifically
> targeted at large callees.

> I'm not sure what limit you're referring to here. There are early
> exits from the analyzeCall method once the maximum possible
> threshold (the threshold if all bonuses are to be applied) is
> reached. This change will increase the maximum possible threshold
> and hence you end up processing more. AFAICT, the cost is linear on
> the number of instructions in the callee per callsite.

Right, and bounded by that threshold.

> > The estimated
> > speedup is
> >
> > Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
> >
> >
> > Speedup above a pre-determined threshold implies there is an
> > expected
> > benefit in inlining the callee F and hence a bonus may be applied
> > to
> > the associated threshold at that callsite.
> >
> >
> > Details
> > ----------
> > This proposal is dependent on the new pass manager that would allow
> > inline cost analysis to see function level analysis passes. The
> > outlined function dynamic instruction count can be provided by an
> > analysis pass. This dynamic instruction count and the block
> > frequency can be either updated (preferable, imo) or recomputed
> > after each inlining.
> >
> >
> >
> > I prototyped a version of this (the prototype augments the
> > BasicBlock
> > and Function classes to store the block frequency and function
> > execution times) and did some initial evaluation with clang and
> > some
> > internal benchmarks used at Google. Implementing it as described
> > above resulted in a large size growth when the parameters are
> > chosen
> > to deliver performance improvement. Some ways to control size
> > growth
> > include applying the heuristic only for hot callsites, only for
> > inline functions, and measuring the speedup using both caller and
> > callee time (instead of just the latter). In any case, without
> > profile feedback it is very likely that there is going to be a size
> > increase when this heuristic is triggered on cold modules. With
> > profile feedback, the absolute count of the callsite can be used to
> > limit this to only hot callsites.
> >
> >
> >
> > Alternative approach
> > --------------------------
> > An alternative approach is to set the thresholds proportional to
> > the
> > estimated speedup, instead of just having a low and high thresholds
> > (for callees whose estimated speedup is below and above the cutoff,
> > respectively). In my opinion such an approach adds complexity in
> > performance debugging and tuning. My guess is this is not going to
> > bring significant additional improvement over a well-tuned simple
> > approach, but perhaps worth exploring later.
> >
> >
> > Preliminary Results
> > ---------------------------
> > With the caveat that the prototype implementation is not well
> > validated and very little tuning has been done, I have some
> > preliminary numbers. These were obtained by using a very aggressive
> > 12X bonus for estimated speedup of 15% or more.
> >
> >
> > * Clang (run with -O2 on a large preprocessed file): 1% performance
> > improvement and a 4.6% text size increase.
> > * Google benchmark suite (geomean of ~20 benchmarks): 5.5%
> > performance improvement and a 7.7% text size increase
> >
> > * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size
> > increase
> >
> > * Chrome: Performance neutral, 3.8% text size increase

> What was the compile-time effect?

> Good question :) I just measured a build of clang with this change
> ('time ninja -J1 clang'). This increases *build* time by 13% which
> is certainly an issue that needs to be addressed. Assuming time
> spent on non-compile actions does not change, the actual compile
> time increase is likely a bit more than 13%.

> A few points regarding this:

> * Independent of this change, we could cache the results of the
> inline cost computation. One simple approach is to cache when none
> of the parameters at the callsite evaluates to a constant. The
> effect of such savings is likely to be more with this change.

> * The 12X threshold bonus needs to be tuned. I chose this factor to
> enable one beneficial inline in an internal benchmark I used.
> * I want to re-emphasize this is a quick prototype and I am sure
> there are many inefficiencies in this prototype.

> * There may be opportunities to further prune the set of callsites
> where this heuristic is applied.

Sounds good.

> >
> >
> > The I propose to implement this change guarded by a flag. This flag
> > could be turned on in O2 with profile guided optimization. If size
> > increase under -O3 is not a huge concern, this could be turned on
> > in
> > -O3 even without PGO.
> >

> For my users, such an increase would be generally acceptable,
> however, others might have a different opinion. I'd certainly
> support a -O4 level with more-aggressive inlining in that case.

> I would like to hear from other users of -O3 what is an acceptable
> performance/code size tradeoff for them. I measured the effects of
> -O3 on our internal benchmark suite. The geomean performance
> improved (compared to -O2, both without this proposed change) by
> 0.6% at a cost of 4.5% size increase. If this is representative of
> what other see, I would argue that it is reasonalbe to enable this
> proposal at O3.

I agree.

 -Hal

> - Easwaran

> -Hal

> >
> > Looking forward to your feedback.
> >
> >
> > Thanks,
> > Easwaran

> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >

> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

> --
Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-- 
Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory