[PATCHES] A module inliner pass with a greedy call site queue

Xinliang David Li xinliangli at gmail.com
Wed Aug 20 10:44:49 PDT 2014


On Wed, Aug 20, 2014 at 10:31 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
> > From: "Xinliang David Li" <xinliangli at gmail.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "LLVM Commits" <llvm-commits at cs.uiuc.edu>, "Jiangning Liu" <
> Jiangning.Liu at arm.com>, "Nick Lewycky"
> > <nicholas at mxc.ca>
> > Sent: Wednesday, August 20, 2014 10:49:15 AM
> > Subject: Re: [PATCHES] A module inliner pass with a greedy call site
> queue
> >
> > On Wed, Aug 20, 2014 at 1:10 AM, Hal Finkel < hfinkel at anl.gov >
> > wrote:
> >
> >
> >
> > ----- Original Message -----
> > > From: "Xinliang David Li" < xinliangli at gmail.com >
> >
> >
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > Cc: "LLVM Commits" < llvm-commits at cs.uiuc.edu >, "Jiangning Liu" <
> > > Jiangning.Liu at arm.com >, "Nick Lewycky"
> > > < nicholas at mxc.ca >
> > > Sent: Tuesday, August 19, 2014 11:40:28 PM
> > > Subject: Re: [PATCHES] A module inliner pass with a greedy call
> > > site queue
> > >
> > > On Tue, Aug 19, 2014 at 3:09 PM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Xinliang David Li" < xinliangli at gmail.com >
> > > > To: "Nick Lewycky" < nicholas at mxc.ca >
> > > > Cc: "LLVM Commits" < llvm-commits at cs.uiuc.edu >, "Jiangning Liu"
> > > > <
> > > > Jiangning.Liu at arm.com >
> > > > Sent: Friday, August 8, 2014 3:18:55 AM
> > > > Subject: Re: [PATCHES] A module inliner pass with a greedy call
> > > > site queue
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > > Global inliner is the word I use for priority queue based
> > > > inliner.
> > > >
> > > >
> > > > 1) it does not define a particular inlining order
> > > > 2) it can be modeled to implement strict bottom-up or top-down
> > > > order
> > > > 3) the analysis can be performed 'globally' on call chains
> > > > instead
> > > > of
> > > > just caller-callee pair.
> > > > 4) it is not necessarily 'greedy'.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I have a strong problem with global metrics. Things like "only
> > > > allow
> > > > X% code size growth" mean that whether I inline this callsite can
> > > > depend on seemingly unrelated factors like how many other
> > > > functions
> > > > are in the same module, even outside the call stack at hand.
> > > > Similarly for other things like cutoffs about how many inlinings
> > > > are
> > > > to be performed (now it depends on traversal order, and if you
> > > > provide the inliner with a more complete program then it may
> > > > chose
> > > > to not inline calls it otherwise would have). I don't like spooky
> > > > action at a distance, it's hard to predict and hard to debug.
> > > >
> > > >
> > > >
> > > > yes, global cutoff is a poor man's method to model 'inlining cost
> > > > >
> > > > benefit'. However, it does not mean the global inliner can not do
> > > > better. Using cutoff is not inherent to the global inliner,
> > > > though
> > > > the most common approximation.
> > >
> > > I agree with Nick, having module changes affect inlining of
> > > functions
> > > in no way related except for the fact that they happen to be in the
> > > same module is not acceptable. We must think of a better way. If
> > > you
> > > have ideas on how we might do this, please elaborate on them. I
> > > suspect there is some disconnected subgraph localization that can
> > > be
> > > applied.
> > >
> > >
> > >
> > > It is undoubtedly bad when you get different inlining decisions
> > > when
> > > you add or remove some unrelated stuff from a module.
> >
> > Good, we're all on the same page then :-) Nevertheless, I consider it
> > to be a requirement that this not happen (please keep in mind that
> > not all LLVM modules come from C/C++ source files, but are generated
> > by all kinds of things). I see no reason why we could not partition
> > the call graph into disconnected components and only apply the limit
> > per component. Perhaps not a spectacular solution, but it seems
> > practical.
> >
> >
> > > However in
> > > reality for a well designed inliner which has other heuristics or
> > > filtering based on code analysis, the module limit is actually not
> > > likely to be hit before the queue is exhausted (for smaller
> > > modules,
> > > the growth budget can be larger). The limit is there to prevent
> > > extreme cases.
> >
> > It would be good to know how often this limit is actually hit in
> > practice. Does it ever happen in SPEC, or the LLVM test-suite or
> > during self hosting, etc.?
> >
> >
> >
> >
> >
> > SPEC is not interesting as it the source won't change.
>
> I don't understand this sentence.
>


If the problem people worrying about is random performance changes due to
unrelated source changes, than SPEC is not relevant.



>
> > Please do
> > remember that that callsites hitting the global limit are usually
> > 'very low' in ranking for inlining, so in theory they should not
> > matter much for performance. If it does swing performance badly, you
> > end up with a bigger problem to solve --- fix the inline heurisitc
> > to hoist the priority for that callsite. Half jokingly, I consider
> > this (global limit) a feature (to find opportunities) :).
>
> I understand your point, and I realize you're half joking (and so I'm
> half-awkwardly still being serious), but there are much better ways of
> getting feedback from users than having them complain about random
> performance variations,


Yes - there might be better ways -- but I won't count on users to be able
to report those (missing inlines) though.


> and I won't be laughing after I need to track some of them down. We do
> need to collect good statistics, but that can be done with some appropriate
> infrastructure.
>

I don't have good statistics, but over the last couple of years, I only
remember 1 or 2 cases where users reported problems like this -- and the
test cases are also micro benchmarks (which are sensitive to basically
anything). For large apps, the record is 0.


thanks,

David



>
> Thanks again,
> Hal
>
> >
> >
> > David
> >
> >
> >
> >
> >
> > Thanks again,
> > Hal
> >
> >
> >
> > >
> > >
> > > David
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > -Hal
> > >
> > >
> > >
> > > >
> > > >
> > > >
> > > > We *do* want more context in the inliner, that's the largest
> > > > known
> > > > deficiency of our current one. Again, the pass manager rewrite is
> > > > taking place to allow the inliner to call into function analysis
> > > > passes so that we can have more context available when making our
> > > > inlining decision. It's just a long, slow path to getting what we
> > > > want.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Algorithms such as a bottom-up inliner
> > > >
> > > >
> > > > analyze a callsite and assign it a value. This could be bottom-up
> > > > or
> > > > top-down, it doesn't really matter. What matters is that
> > > > eventually,
> > > > all
> > > > (rational) callsites end up in the same sorted datastructure and
> > > > are
> > > > addressed in order.
> > > >
> > > > Am I missing something?
> > > >
> > > > The current inliner doesn't assign values across the whole call
> > > > graph
> > > > then decide where to inline.
> > > >
> > > > Firstly, the local decision (looking at a single caller-callee
> > > > pair
> > > > through a particular call site) works by attempting to determine
> > > > how
> > > > much of the callee will be live given the values known at the
> > > > caller. For instance, we will resolve a switch statement to its
> > > > destination block, and potentially eliminate other callees. These
> > > > simplifications would still be possible even if we calculated
> > > > everything up front.
> > > >
> > > > Secondly, we iterate with the function passes optimizing the new
> > > > function after each inlining is performed. This may eliminate
> > > > dead
> > > > code (potentially removing call graph edges) and can resolve
> > > > loads
> > > > (potentially creating new call graph edges as indirect calls are
> > > > resolved to direct calls). Handling the CFG updates is one of the
> > > > more interesting and difficult parts of the inliner, and it's
> > > > very
> > > > important for getting C++ virtual calls right. This sort of thing
> > > > can't be calculated up front.
> > > >
> > > > Nick
> > > >
> > > > PS. You may have guessed that I'm just plain prejudiced against
> > > > top-down inliners. I am, and I should call that out before going
> > > > too
> > > > far down into the discussion.
> > > >
> > > > In the past I've seem them used for their ability to game
> > > > benchmarks
> > > > (that's my side of the story, not theirs). You provide an inliner
> > > > with tweakable knobs that have really messy complicated
> > > > interactions
> > > > all across the inliner depending on all sorts of things, then you
> > > > select the numbers that happen to give you a 20% speed up on SPEC
> > > > for no good reason, and call it success. Attribute the success to
> > > > the flexibility provided by the design.
> > > >
> > > >
> > > >
> > > >
> > > > I have seen compiler to add benchmark specific hacks, but I have
> > > > also
> > > > seen compiler that does excellent job implementing generally
> > > > useful
> > > > inlining heuristics (cost/benefit functions) based on study of
> > > > SPEC
> > > > benchmarks and cross validate them on large ISV programs such as
> > > > database severs. Think about this: if you can tune the parameter
> > > > to
> > > > speed up one benchmark 20% without degrading others, even though
> > > > the
> > > > tuning itself maybe bogus, it proves the fact the global inliner
> > > > is
> > > > quite flexible and tunable. Pure bottom-up inliner will find a
> > > > hard
> > > > time doing so.
> > > >
> > > >
> > > > Having said this, getting the global inliner work right may take
> > > > years of refinement and tuning to get it right. One thing is that
> > > > it
> > > > can not rely on the on-the-fly cleanups/scalar ops to get precise
> > > > summaries.
> > > >
> > > >
> > > > David
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 6 August 2014 08:54, Nick Lewycky < nicholas at mxc.ca
> > > >
> > > > <mailto: nicholas at mxc.ca >> wrote:
> > > >
> > > > Hal Finkel wrote:
> > > >
> > > > I'd like you to elaborate on your assertion here, however, that
> > > > a "topdown inliner is going to work best when you have the whole
> > > > program." It seems to me that, whole program or not, a top-down
> > > > inlining approach is exactly what you want to avoid the
> > > > vector-push_back-cold-path- inlining problem (because, from the
> > > >
> > > >
> > > > caller, you see many calls to push_back, which is small --
> > > > because the hot path is small and the cold path has not (yet)
> > > > been inlined -- and inlines them all, at which point it can make
> > > > a sensible decision about the cold-path calls).
> > > >
> > > >
> > > > I don't see that. You get the same information when looking at a
> > > > pair of functions and deciding whether to inline. With the
> > > > bottom-up
> > > > walk, we analyze the caller and callee in their entirety before
> > > > deciding whether to inline. I assume a top-down inliner would do
> > > > the
> > > > same.
> > > >
> > > > If you have a top-down traversal and you don't have the whole
> > > > program, the first problem you have is a whole ton of starting
> > > > points. At first blush bottom up seems to have the same problem,
> > > > except that they are generally very straight-forward leaf
> > > > functions
> > > > -- setters and getters or little loops to test for a property.
> > > > Top
> > > > down you don't yet know what you've got, and it has lots of calls
> > > > that may access arbitrary memory. In either case, you apply your
> > > > metric to inline or not. Then you run the function-level passes
> > > > to
> > > > perform simplification. Bottom up, you're much more likely to get
> > > > meaningful simplifications -- your getter/setter melts away. Top
> > > > down you may remove some redundant loads or dead stores, but you
> > > > still don't know what's going on because you have these opaque
> > > > not-yet-analyzed callees in the way. If you couldn't analyze the
> > > > memory before, inlining one level away hasn't helped you, and the
> > > > function size has grown. You don't get the simplifications until
> > > > you
> > > > go all the way down the call stack to the setters and getters
> > > > etc.
> > > >
> > > > There's a fix for this, and that's to perform a sort of symbolic
> > > > execution and just keep track of what the program has done so far
> > > > (ie. what values registers have taken on so far, which pointers
> > > > have
> > > > escaped etc.), and make each inlining decision in program
> > > > execution
> > > > order. But that fix doesn't get you very far if you haven't got a
> > > > significant chunk of program to work with.
> > > >
> > > >
> > > > Nick
> > > > ______________________________ _________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu <mailto: llvm-commits at cs.uiuc. edu >
> > > > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> > > > < http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > ______________________________ _________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> > >
> > > >
> > > >
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > >
> > >
> > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140820/e8d94540/attachment.html>


More information about the llvm-commits mailing list