[llvm-dev] [RFC] Target-specific parametrization of function inliner

Fri Apr 1 12:20:12 PDT 2016

----- Original Message -----

> From: "Xinliang David Li" <davidxl at google.com>
> To: "Chandler Carruth" <chandlerc at google.com>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "Artem Belevich"
> <tra at google.com>, "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Thursday, March 10, 2016 12:34:07 PM
> Subject: Re: [llvm-dev] [RFC] Target-specific parametrization of
> function inliner

> On Thu, Mar 10, 2016 at 6:49 AM, Chandler Carruth <
> chandlerc at google.com > wrote:

> > IMO, the appropriate thing for TTI to inform the inliner about is
> > how
> > costly the actual act of a "call" is likely to be. I would hope
> > that
> > this would only be used on targets where there is some really
> > dramatic overhead of actually doing a function call such that the
> > code size cost incurred by inlining is completely dwarfed by the
> > improvements. GPUs are one of the few platforms that exhibit this
> > kind of behavior, although I don't think they're truly unique, just
> > a common example.
> 

> > This isn't quite the same thing as the cost of the call
> > instruction,
> > which has much more to do with the size. Instead, it has to do with
> > the expected consequences of actually leaving a call edge in the
> > program.
> 
> > To me, this pretty accurately reflects the TTI hook we have for
> > customizing loop unrolling where the cost of having a cyclic CFG is
> > modeled to help indicate that on some targets (also GPUs) it is
> > worth a very large amount of code size growth to simplify the
> > control flow in a particular way.
> 

> From 10000 foot, the LLVM inliner implements a size based heuristic :
> if the inline instance's size*/cost after simplification via
> propagating the call context (actually the relative size -- the
> callsite cost is subtracted from it), is smaller than a threshold
> (adjusted from a base value), then the callsite is considered an
> inline candidate. In most cases, the decision is made locally due to
> the bottom-up order (there are tweaks to bypass it). The size/cost
> can be remotely tied and serves a proxy to represent the real
> runtime cost due to icache/itlb effect, but it seems the
> size/threshold scheme is mainly used to model the runtime speedup vs
> compile time/binary size tradeoffs.

Yes, is kind of gets this. But sometimes not very well. 

Part of the problem is that we try to make local decisions to control global code size. I understand why we do this, but we shouldn't. The local metric should purely be based on local speedup. That metric itself can be modulated by a global heuristic to control costs from itlb/icache misses, etc. - and there's not too much we can do here without profiling data with temporal correlation information. 

> Set aside what we need longer term for the inliner, the GPU specific
> problems can be addressed by
> 1) if the call overhead is really large, define a target specific
> getCallCost and subtract it from the initial Cost when analyzing a
> callsite (this will help boost all targets with high call costs)
Yes, this makes sense. 

> 2) if not, but instead GPU users can tolerate large code growth, then
> it is better to this by adjusting the threshold -- perhaps have a
> user level option -finline-limit=?
Providing a user option makes sense, but as I said in the other e-mail, we should phrase it in terms of something not in arbitrary units. I think that % speedup makes the most sense. As it turns out, inlining a large function will probably produce a low speedup (especially if we have partial inlining where we only inline the hot regions), so this ends up correlated with code size too. 

-Hal 

> thanks,

> David

> * some target dependent info may be used: TTI.getUserCost

> > Does that make sense to you Hal? Based on that, it would really
> > just
> > be a scaling factor of the inline heuristics. Unsure of how to more
> > scientifically express this construct.
> 

> > -Chandler
> 

> > On Thu, Mar 10, 2016 at 3:42 PM Hal Finkel via llvm-dev <
> > llvm-dev at lists.llvm.org > wrote:
> 

> > > ----- Original Message -----
> > 
> 

> > > > From: "Artem Belevich via llvm-dev" < llvm-dev at lists.llvm.org >
> > 
> 
> > > > To: "llvm-dev" < llvm-dev at lists.llvm.org >
> > 
> 
> > > > Sent: Tuesday, March 1, 2016 6:31:06 PM
> > 
> 
> > > > Subject: [llvm-dev] [RFC] Target-specific parametrization of
> > > > function inliner
> > 
> 
> > > >
> > 
> 
> > > > Hi,
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > I propose to make function inliner parameters adjustable for
> > > > specific
> > 
> 
> > > > target.
> > 
> 
> > > >
> > 
> 
> > > > Currently function inlining pass appears to be target-agnostic
> > > > with
> > 
> 
> > > > various constants for calculating call cost hardcoded. While it
> > 
> 
> > > > works reasonably well for general purpose CPUs, some quirkier
> > 
> 
> > > > targets like NVPTX would benefit from target-specific tuning.
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > Currently it appears that there are two things that need to be
> > > > done:
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > * add Inliner preferences to TargetTransformInfo in a way
> > > > similar
> > > > to
> > 
> 
> > > > how we customize loop unrolling. Use it to provide inliner with
> > 
> 
> > > > target-specific thresholds and other parameters.
> > 
> 
> > > > * augment Inliner pass to use existing TargetTransformInfo API
> > > > to
> > 
> 
> > > > figure out cost of particular call on a given target.
> > 
> 
> > > > TargetTransforInfo already has getCallCost(), though it does
> > > > not
> > 
> 
> > > > look like anything uses it.
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > Comments? Concerns? Suggestions?
> > 
> 
> > > >
> > 
> 

> > > Hi Art,
> > 
> 

> > > I've long thought that we should have a more principled way of
> > > doing
> > > inline profitability. There is obviously some cost to executing a
> > > function body, some call site overhead, and some cost reduction
> > > associated with any post-inlining simplifications. If inlining
> > > reduces the overall call site cost by more than some factor, say
> > > 1%
> > > (this should probably depend on the optimization level), then we
> > > should inline. With profiling information, we might even use
> > > global
> > > speedup instead of local speedup.
> > 
> 

> > > Whether we need a target customization of this threshold, or just
> > > a
> > > way for a target to supplement the fine inlining decision, is
> > > unclear to me. It is also true that a the result of a bunch of
> > > locally-optimal decisions might be far from the global optimum.
> > > Maybe the target has something to say about that?
> > 
> 

> > > In short, I'm fine with what you're proposing, but to the extent
> > > possible, I want the numbers provided by the target to mean
> > > something. Replacing a global set of somewhat-arbitrary magic
> > > numbers, with target-specific sets of somewhat-arbitrary magic
> > > numbers should be our last choice.
> > 
> 

> > > Thanks again,
> > 
> 
> > > Hal
> > 
> 

> > > >
> > 
> 
> > > > Thanks,
> > 
> 
> > > > --
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > --Artem Belevich
> > 
> 
> > > > _______________________________________________
> > 
> 
> > > > LLVM Developers mailing list
> > 
> 
> > > > llvm-dev at lists.llvm.org
> > 
> 
> > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> 
> > > >
> > 
> 

> > > --
> > 
> 
> > > Hal Finkel
> > 
> 
> > > Assistant Computational Scientist
> > 
> 
> > > Leadership Computing Facility
> > 
> 
> > > Argonne National Laboratory
> > 
> 
> > > _______________________________________________
> > 
> 
> > > LLVM Developers mailing list
> > 
> 
> > > llvm-dev at lists.llvm.org
> > 
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> 

-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160401/f7133fa8/attachment.html>