[PATCHES] A module inliner pass with a greedy call site queue

Wed Aug 20 01:10:09 PDT 2014

----- Original Message -----
> From: "Xinliang David Li" <xinliangli at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Commits" <llvm-commits at cs.uiuc.edu>, "Jiangning Liu" <Jiangning.Liu at arm.com>, "Nick Lewycky"
> <nicholas at mxc.ca>
> Sent: Tuesday, August 19, 2014 11:40:28 PM
> Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue
> 
> On Tue, Aug 19, 2014 at 3:09 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "Xinliang David Li" < xinliangli at gmail.com >
> > To: "Nick Lewycky" < nicholas at mxc.ca >
> > Cc: "LLVM Commits" < llvm-commits at cs.uiuc.edu >, "Jiangning Liu" <
> > Jiangning.Liu at arm.com >
> > Sent: Friday, August 8, 2014 3:18:55 AM
> > Subject: Re: [PATCHES] A module inliner pass with a greedy call
> > site queue
> > 
> > 
> > 
> > 
> > 
> > 
> 
> > Global inliner is the word I use for priority queue based inliner.
> > 
> > 
> > 1) it does not define a particular inlining order
> > 2) it can be modeled to implement strict bottom-up or top-down
> > order
> > 3) the analysis can be performed 'globally' on call chains instead
> > of
> > just caller-callee pair.
> > 4) it is not necessarily 'greedy'.
> > 
> > 
> > 
> > 
> > 
> > 
> > I have a strong problem with global metrics. Things like "only
> > allow
> > X% code size growth" mean that whether I inline this callsite can
> > depend on seemingly unrelated factors like how many other functions
> > are in the same module, even outside the call stack at hand.
> > Similarly for other things like cutoffs about how many inlinings
> > are
> > to be performed (now it depends on traversal order, and if you
> > provide the inliner with a more complete program then it may chose
> > to not inline calls it otherwise would have). I don't like spooky
> > action at a distance, it's hard to predict and hard to debug.
> > 
> > 
> > 
> > yes, global cutoff is a poor man's method to model 'inlining cost >
> > benefit'. However, it does not mean the global inliner can not do
> > better. Using cutoff is not inherent to the global inliner, though
> > the most common approximation.
> 
> I agree with Nick, having module changes affect inlining of functions
> in no way related except for the fact that they happen to be in the
> same module is not acceptable. We must think of a better way. If you
> have ideas on how we might do this, please elaborate on them. I
> suspect there is some disconnected subgraph localization that can be
> applied.
> 
> 
> 
> It is undoubtedly bad when you get different inlining decisions when
> you add or remove some unrelated stuff from a module.

Good, we're all on the same page then :-) Nevertheless, I consider it to be a requirement that this not happen (please keep in mind that not all LLVM modules come from C/C++ source files, but are generated by all kinds of things). I see no reason why we could not partition the call graph into disconnected components and only apply the limit per component. Perhaps not a spectacular solution, but it seems practical.

> However in
> reality for a well designed inliner which has other heuristics or
> filtering based on code analysis, the module limit is actually not
> likely to be hit before the queue is exhausted (for smaller modules,
> the growth budget can be larger). The limit is there to prevent
> extreme cases.

It would be good to know how often this limit is actually hit in practice. Does it ever happen in SPEC, or the LLVM test-suite or during self hosting, etc.?

Thanks again,
Hal

> 
> 
> David
> 
> 
> 
> 
> 
> 
> 
> -Hal
> 
> 
> 
> > 
> > 
> > 
> > We *do* want more context in the inliner, that's the largest known
> > deficiency of our current one. Again, the pass manager rewrite is
> > taking place to allow the inliner to call into function analysis
> > passes so that we can have more context available when making our
> > inlining decision. It's just a long, slow path to getting what we
> > want.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Algorithms such as a bottom-up inliner
> > 
> > 
> > analyze a callsite and assign it a value. This could be bottom-up
> > or
> > top-down, it doesn't really matter. What matters is that
> > eventually,
> > all
> > (rational) callsites end up in the same sorted datastructure and
> > are
> > addressed in order.
> > 
> > Am I missing something?
> > 
> > The current inliner doesn't assign values across the whole call
> > graph
> > then decide where to inline.
> > 
> > Firstly, the local decision (looking at a single caller-callee pair
> > through a particular call site) works by attempting to determine
> > how
> > much of the callee will be live given the values known at the
> > caller. For instance, we will resolve a switch statement to its
> > destination block, and potentially eliminate other callees. These
> > simplifications would still be possible even if we calculated
> > everything up front.
> > 
> > Secondly, we iterate with the function passes optimizing the new
> > function after each inlining is performed. This may eliminate dead
> > code (potentially removing call graph edges) and can resolve loads
> > (potentially creating new call graph edges as indirect calls are
> > resolved to direct calls). Handling the CFG updates is one of the
> > more interesting and difficult parts of the inliner, and it's very
> > important for getting C++ virtual calls right. This sort of thing
> > can't be calculated up front.
> > 
> > Nick
> > 
> > PS. You may have guessed that I'm just plain prejudiced against
> > top-down inliners. I am, and I should call that out before going
> > too
> > far down into the discussion.
> > 
> > In the past I've seem them used for their ability to game
> > benchmarks
> > (that's my side of the story, not theirs). You provide an inliner
> > with tweakable knobs that have really messy complicated
> > interactions
> > all across the inliner depending on all sorts of things, then you
> > select the numbers that happen to give you a 20% speed up on SPEC
> > for no good reason, and call it success. Attribute the success to
> > the flexibility provided by the design.
> > 
> > 
> > 
> > 
> > I have seen compiler to add benchmark specific hacks, but I have
> > also
> > seen compiler that does excellent job implementing generally useful
> > inlining heuristics (cost/benefit functions) based on study of SPEC
> > benchmarks and cross validate them on large ISV programs such as
> > database severs. Think about this: if you can tune the parameter to
> > speed up one benchmark 20% without degrading others, even though
> > the
> > tuning itself maybe bogus, it proves the fact the global inliner is
> > quite flexible and tunable. Pure bottom-up inliner will find a hard
> > time doing so.
> > 
> > 
> > Having said this, getting the global inliner work right may take
> > years of refinement and tuning to get it right. One thing is that
> > it
> > can not rely on the on-the-fly cleanups/scalar ops to get precise
> > summaries.
> > 
> > 
> > David
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On 6 August 2014 08:54, Nick Lewycky < nicholas at mxc.ca
> > 
> > <mailto: nicholas at mxc.ca >> wrote:
> > 
> > Hal Finkel wrote:
> > 
> > I'd like you to elaborate on your assertion here, however, that
> > a "topdown inliner is going to work best when you have the whole
> > program." It seems to me that, whole program or not, a top-down
> > inlining approach is exactly what you want to avoid the
> > vector-push_back-cold-path- inlining problem (because, from the
> > 
> > 
> > caller, you see many calls to push_back, which is small --
> > because the hot path is small and the cold path has not (yet)
> > been inlined -- and inlines them all, at which point it can make
> > a sensible decision about the cold-path calls).
> > 
> > 
> > I don't see that. You get the same information when looking at a
> > pair of functions and deciding whether to inline. With the
> > bottom-up
> > walk, we analyze the caller and callee in their entirety before
> > deciding whether to inline. I assume a top-down inliner would do
> > the
> > same.
> > 
> > If you have a top-down traversal and you don't have the whole
> > program, the first problem you have is a whole ton of starting
> > points. At first blush bottom up seems to have the same problem,
> > except that they are generally very straight-forward leaf functions
> > -- setters and getters or little loops to test for a property. Top
> > down you don't yet know what you've got, and it has lots of calls
> > that may access arbitrary memory. In either case, you apply your
> > metric to inline or not. Then you run the function-level passes to
> > perform simplification. Bottom up, you're much more likely to get
> > meaningful simplifications -- your getter/setter melts away. Top
> > down you may remove some redundant loads or dead stores, but you
> > still don't know what's going on because you have these opaque
> > not-yet-analyzed callees in the way. If you couldn't analyze the
> > memory before, inlining one level away hasn't helped you, and the
> > function size has grown. You don't get the simplifications until
> > you
> > go all the way down the call stack to the setters and getters etc.
> > 
> > There's a fix for this, and that's to perform a sort of symbolic
> > execution and just keep track of what the program has done so far
> > (ie. what values registers have taken on so far, which pointers
> > have
> > escaped etc.), and make each inlining decision in program execution
> > order. But that fix doesn't get you very far if you haven't got a
> > significant chunk of program to work with.
> > 
> > 
> > Nick
> > ______________________________ _________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu <mailto: llvm-commits at cs.uiuc. edu >
> > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> > < http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits >
> > 
> > 
> > 
> > 
> > 
> > ______________________________ _________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> 
> > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> 
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory