[llvm-dev] [RFC] Target-specific parametrization of function inliner

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Wed Apr 6 10:46:15 PDT 2016


On Fri, Apr 1, 2016 at 12:35 PM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> ------------------------------
>
> *From: *"Mehdi Amini via llvm-dev" <llvm-dev at lists.llvm.org>
> *To: *"Xinliang David Li" <davidxl at google.com>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>
> *Sent: *Friday, April 1, 2016 2:26:27 PM
> *Subject: *Re: [llvm-dev] [RFC] Target-specific parametrization of
> function inliner
>
>
> On Mar 10, 2016, at 10:34 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> On Thu, Mar 10, 2016 at 6:49 AM, Chandler Carruth <chandlerc at google.com>
> wrote:
>
>> IMO, the appropriate thing for TTI to inform the inliner about is how
>> costly the actual act of a "call" is likely to be. I would hope that this
>> would only be used on targets where there is some really dramatic overhead
>> of actually doing a function call such that the code size cost incurred by
>> inlining is completely dwarfed by the improvements. GPUs are one of the few
>> platforms that exhibit this kind of behavior, although I don't think
>> they're truly unique, just a common example.
>>
>> This isn't quite the same thing as the cost of the call instruction,
>> which has much more to do with the size. Instead, it has to do with the
>> expected consequences of actually leaving a call edge in the program.
>>
>
>
>>
>> To me, this pretty accurately reflects the TTI hook we have for
>> customizing loop unrolling where the cost of having a cyclic CFG is modeled
>> to help indicate that on some targets (also GPUs) it is worth a very large
>> amount of code size growth to simplify the control flow in a particular way.
>>
>>
> From 10000 foot, the LLVM inliner implements a size based heuristic :  if
> the inline instance's size*/cost after simplification via propagating the
> call context (actually the relative size -- the callsite cost is subtracted
> from it), is smaller than a threshold (adjusted from a base value), then
> the callsite is considered an inline candidate. In most cases, the decision
> is made locally due to the bottom-up order (there are tweaks to bypass it).
>   The size/cost can be remotely tied and serves a proxy to represent the
> real runtime cost due to icache/itlb effect, but it seems the
> size/threshold scheme is mainly used to model the runtime speedup vs
> compile time/binary size tradeoffs.
>
>
> Other than the call cost itself, I've been surprised that the TTI is not
> more involved when it comes to this tradeoff: instructions don't have the
> same tradeoff depending on the platform (oh this operation is not legal on
> this type and will be expanded in multiple instructions in SDAG, too bad..).
>
> I think that doing this was intended, but we've not done it yet (as we did
> for the throughput model used for vectorization). I think we should (I also
> think we should combine the cost models so that we have a single model that
> returns multiple kinds of costs (throughput, size, latency, etc.)).
>


yes -- the time/speed up estimate should be independent of size increase
estimate.

David


>
>
>  -Hal
>
>
> --
> Mehdi
>
>
>
> Set aside what we need longer term for the inliner, the GPU specific
> problems can be addressed by
> 1) if the call overhead is really large, define a target specific
> getCallCost and subtract it from the initial Cost when analyzing a callsite
> (this will help boost all targets with high call costs)
> 2) if not, but instead GPU users can tolerate large code growth, then it
> is better to this by adjusting the threshold -- perhaps have a user level
> option -finline-limit=?
>
> thanks,
>
> David
>
>
> * some target dependent info may be used: TTI.getUserCost
>
>
>> Does that make sense to you Hal? Based on that, it would really just be a
>> scaling factor of the inline heuristics. Unsure of how to more
>> scientifically express this construct.
>>
>> -Chandler
>>
>> On Thu, Mar 10, 2016 at 3:42 PM Hal Finkel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> ------------------------------
>>>
>>> > From: "Artem Belevich via llvm-dev" <llvm-dev at lists.llvm.org>
>>> > To: "llvm-dev" <llvm-dev at lists.llvm.org>
>>> > Sent: Tuesday, March 1, 2016 6:31:06 PM
>>> > Subject: [llvm-dev] [RFC] Target-specific parametrization of function
>>> inliner
>>> >
>>> > Hi,
>>> >
>>> >
>>> > I propose to make function inliner parameters adjustable for specific
>>> > target.
>>> >
>>> > Currently function inlining pass appears to be target-agnostic with
>>> > various constants for calculating call cost hardcoded. While it
>>> > works reasonably well for general purpose CPUs, some quirkier
>>> > targets like NVPTX would benefit from target-specific tuning.
>>> >
>>> >
>>> > Currently it appears that there are two things that need to be done:
>>> >
>>> >
>>> > * add Inliner preferences to TargetTransformInfo in a way similar to
>>> > how we customize loop unrolling. Use it to provide inliner with
>>> > target-specific thresholds and other parameters.
>>> > * augment Inliner pass to use existing TargetTransformInfo API to
>>> > figure out cost of particular call on a given target.
>>> > TargetTransforInfo already has getCallCost(), though it does not
>>> > look like anything uses it.
>>> >
>>> >
>>> > Comments? Concerns? Suggestions?
>>> >
>>>
>>> Hi Art,
>>>
>>> I've long thought that we should have a more principled way of doing
>>> inline profitability. There is obviously some cost to executing a function
>>> body, some call site overhead, and some cost reduction associated with any
>>> post-inlining simplifications. If inlining reduces the overall call site
>>> cost by more than some factor, say 1% (this should probably depend on the
>>> optimization level), then we should inline. With profiling information, we
>>> might even use global speedup instead of local speedup.
>>>
>>> Whether we need a target customization of this threshold, or just a way
>>> for a target to supplement the fine inlining decision, is unclear to me. It
>>> is also true that a the result of a bunch of locally-optimal decisions
>>> might be far from the global optimum. Maybe the target has something to say
>>> about that?
>>>
>>> In short, I'm fine with what you're proposing, but to the extent
>>> possible, I want the numbers provided by the target to mean something.
>>> Replacing a global set of somewhat-arbitrary magic numbers, with
>>> target-specific sets of somewhat-arbitrary magic numbers should be our last
>>> choice.
>>>
>>> Thanks again,
>>> Hal
>>>
>>>
>>> >
>>> > Thanks,
>>> > --
>>> >
>>> >
>>> > --Artem Belevich
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > llvm-dev at lists.llvm.org
>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> >
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160406/e6f9dc45/attachment.html>


More information about the llvm-dev mailing list