[PATCHES] A module inliner pass with a greedy call site queue

Wed Aug 20 10:31:15 PDT 2014

----- Original Message -----
> From: "Xinliang David Li" <xinliangli at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Commits" <llvm-commits at cs.uiuc.edu>, "Jiangning Liu" <Jiangning.Liu at arm.com>, "Nick Lewycky"
> <nicholas at mxc.ca>
> Sent: Wednesday, August 20, 2014 10:49:15 AM
> Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue
> 
> On Wed, Aug 20, 2014 at 1:10 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "Xinliang David Li" < xinliangli at gmail.com >
> 
> 
> > To: "Hal Finkel" < hfinkel at anl.gov >
> > Cc: "LLVM Commits" < llvm-commits at cs.uiuc.edu >, "Jiangning Liu" <
> > Jiangning.Liu at arm.com >, "Nick Lewycky"
> > < nicholas at mxc.ca >
> > Sent: Tuesday, August 19, 2014 11:40:28 PM
> > Subject: Re: [PATCHES] A module inliner pass with a greedy call
> > site queue
> > 
> > On Tue, Aug 19, 2014 at 3:09 PM, Hal Finkel < hfinkel at anl.gov >
> > wrote:
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Xinliang David Li" < xinliangli at gmail.com >
> > > To: "Nick Lewycky" < nicholas at mxc.ca >
> > > Cc: "LLVM Commits" < llvm-commits at cs.uiuc.edu >, "Jiangning Liu"
> > > <
> > > Jiangning.Liu at arm.com >
> > > Sent: Friday, August 8, 2014 3:18:55 AM
> > > Subject: Re: [PATCHES] A module inliner pass with a greedy call
> > > site queue
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > > Global inliner is the word I use for priority queue based
> > > inliner.
> > > 
> > > 
> > > 1) it does not define a particular inlining order
> > > 2) it can be modeled to implement strict bottom-up or top-down
> > > order
> > > 3) the analysis can be performed 'globally' on call chains
> > > instead
> > > of
> > > just caller-callee pair.
> > > 4) it is not necessarily 'greedy'.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > I have a strong problem with global metrics. Things like "only
> > > allow
> > > X% code size growth" mean that whether I inline this callsite can
> > > depend on seemingly unrelated factors like how many other
> > > functions
> > > are in the same module, even outside the call stack at hand.
> > > Similarly for other things like cutoffs about how many inlinings
> > > are
> > > to be performed (now it depends on traversal order, and if you
> > > provide the inliner with a more complete program then it may
> > > chose
> > > to not inline calls it otherwise would have). I don't like spooky
> > > action at a distance, it's hard to predict and hard to debug.
> > > 
> > > 
> > > 
> > > yes, global cutoff is a poor man's method to model 'inlining cost
> > > >
> > > benefit'. However, it does not mean the global inliner can not do
> > > better. Using cutoff is not inherent to the global inliner,
> > > though
> > > the most common approximation.
> > 
> > I agree with Nick, having module changes affect inlining of
> > functions
> > in no way related except for the fact that they happen to be in the
> > same module is not acceptable. We must think of a better way. If
> > you
> > have ideas on how we might do this, please elaborate on them. I
> > suspect there is some disconnected subgraph localization that can
> > be
> > applied.
> > 
> > 
> > 
> > It is undoubtedly bad when you get different inlining decisions
> > when
> > you add or remove some unrelated stuff from a module.
> 
> Good, we're all on the same page then :-) Nevertheless, I consider it
> to be a requirement that this not happen (please keep in mind that
> not all LLVM modules come from C/C++ source files, but are generated
> by all kinds of things). I see no reason why we could not partition
> the call graph into disconnected components and only apply the limit
> per component. Perhaps not a spectacular solution, but it seems
> practical.
> 
> 
> > However in
> > reality for a well designed inliner which has other heuristics or
> > filtering based on code analysis, the module limit is actually not
> > likely to be hit before the queue is exhausted (for smaller
> > modules,
> > the growth budget can be larger). The limit is there to prevent
> > extreme cases.
> 
> It would be good to know how often this limit is actually hit in
> practice. Does it ever happen in SPEC, or the LLVM test-suite or
> during self hosting, etc.?
> 
> 
> 
> 
> 
> SPEC is not interesting as it the source won't change.

I don't understand this sentence.

> Please do
> remember that that callsites hitting the global limit are usually
> 'very low' in ranking for inlining, so in theory they should not
> matter much for performance. If it does swing performance badly, you
> end up with a bigger problem to solve --- fix the inline heurisitc
> to hoist the priority for that callsite. Half jokingly, I consider
> this (global limit) a feature (to find opportunities) :).

I understand your point, and I realize you're half joking (and so I'm half-awkwardly still being serious), but there are much better ways of getting feedback from users than having them complain about random performance variations, and I won't be laughing after I need to track some of them down. We do need to collect good statistics, but that can be done with some appropriate infrastructure.

Thanks again,
Hal

> 
> 
> David
> 
> 
> 
> 
> 
> Thanks again,
> Hal
> 
> 
> 
> > 
> > 
> > David
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -Hal
> > 
> > 
> > 
> > > 
> > > 
> > > 
> > > We *do* want more context in the inliner, that's the largest
> > > known
> > > deficiency of our current one. Again, the pass manager rewrite is
> > > taking place to allow the inliner to call into function analysis
> > > passes so that we can have more context available when making our
> > > inlining decision. It's just a long, slow path to getting what we
> > > want.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Algorithms such as a bottom-up inliner
> > > 
> > > 
> > > analyze a callsite and assign it a value. This could be bottom-up
> > > or
> > > top-down, it doesn't really matter. What matters is that
> > > eventually,
> > > all
> > > (rational) callsites end up in the same sorted datastructure and
> > > are
> > > addressed in order.
> > > 
> > > Am I missing something?
> > > 
> > > The current inliner doesn't assign values across the whole call
> > > graph
> > > then decide where to inline.
> > > 
> > > Firstly, the local decision (looking at a single caller-callee
> > > pair
> > > through a particular call site) works by attempting to determine
> > > how
> > > much of the callee will be live given the values known at the
> > > caller. For instance, we will resolve a switch statement to its
> > > destination block, and potentially eliminate other callees. These
> > > simplifications would still be possible even if we calculated
> > > everything up front.
> > > 
> > > Secondly, we iterate with the function passes optimizing the new
> > > function after each inlining is performed. This may eliminate
> > > dead
> > > code (potentially removing call graph edges) and can resolve
> > > loads
> > > (potentially creating new call graph edges as indirect calls are
> > > resolved to direct calls). Handling the CFG updates is one of the
> > > more interesting and difficult parts of the inliner, and it's
> > > very
> > > important for getting C++ virtual calls right. This sort of thing
> > > can't be calculated up front.
> > > 
> > > Nick
> > > 
> > > PS. You may have guessed that I'm just plain prejudiced against
> > > top-down inliners. I am, and I should call that out before going
> > > too
> > > far down into the discussion.
> > > 
> > > In the past I've seem them used for their ability to game
> > > benchmarks
> > > (that's my side of the story, not theirs). You provide an inliner
> > > with tweakable knobs that have really messy complicated
> > > interactions
> > > all across the inliner depending on all sorts of things, then you
> > > select the numbers that happen to give you a 20% speed up on SPEC
> > > for no good reason, and call it success. Attribute the success to
> > > the flexibility provided by the design.
> > > 
> > > 
> > > 
> > > 
> > > I have seen compiler to add benchmark specific hacks, but I have
> > > also
> > > seen compiler that does excellent job implementing generally
> > > useful
> > > inlining heuristics (cost/benefit functions) based on study of
> > > SPEC
> > > benchmarks and cross validate them on large ISV programs such as
> > > database severs. Think about this: if you can tune the parameter
> > > to
> > > speed up one benchmark 20% without degrading others, even though
> > > the
> > > tuning itself maybe bogus, it proves the fact the global inliner
> > > is
> > > quite flexible and tunable. Pure bottom-up inliner will find a
> > > hard
> > > time doing so.
> > > 
> > > 
> > > Having said this, getting the global inliner work right may take
> > > years of refinement and tuning to get it right. One thing is that
> > > it
> > > can not rely on the on-the-fly cleanups/scalar ops to get precise
> > > summaries.
> > > 
> > > 
> > > David
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On 6 August 2014 08:54, Nick Lewycky < nicholas at mxc.ca
> > > 
> > > <mailto: nicholas at mxc.ca >> wrote:
> > > 
> > > Hal Finkel wrote:
> > > 
> > > I'd like you to elaborate on your assertion here, however, that
> > > a "topdown inliner is going to work best when you have the whole
> > > program." It seems to me that, whole program or not, a top-down
> > > inlining approach is exactly what you want to avoid the
> > > vector-push_back-cold-path- inlining problem (because, from the
> > > 
> > > 
> > > caller, you see many calls to push_back, which is small --
> > > because the hot path is small and the cold path has not (yet)
> > > been inlined -- and inlines them all, at which point it can make
> > > a sensible decision about the cold-path calls).
> > > 
> > > 
> > > I don't see that. You get the same information when looking at a
> > > pair of functions and deciding whether to inline. With the
> > > bottom-up
> > > walk, we analyze the caller and callee in their entirety before
> > > deciding whether to inline. I assume a top-down inliner would do
> > > the
> > > same.
> > > 
> > > If you have a top-down traversal and you don't have the whole
> > > program, the first problem you have is a whole ton of starting
> > > points. At first blush bottom up seems to have the same problem,
> > > except that they are generally very straight-forward leaf
> > > functions
> > > -- setters and getters or little loops to test for a property.
> > > Top
> > > down you don't yet know what you've got, and it has lots of calls
> > > that may access arbitrary memory. In either case, you apply your
> > > metric to inline or not. Then you run the function-level passes
> > > to
> > > perform simplification. Bottom up, you're much more likely to get
> > > meaningful simplifications -- your getter/setter melts away. Top
> > > down you may remove some redundant loads or dead stores, but you
> > > still don't know what's going on because you have these opaque
> > > not-yet-analyzed callees in the way. If you couldn't analyze the
> > > memory before, inlining one level away hasn't helped you, and the
> > > function size has grown. You don't get the simplifications until
> > > you
> > > go all the way down the call stack to the setters and getters
> > > etc.
> > > 
> > > There's a fix for this, and that's to perform a sort of symbolic
> > > execution and just keep track of what the program has done so far
> > > (ie. what values registers have taken on so far, which pointers
> > > have
> > > escaped etc.), and make each inlining decision in program
> > > execution
> > > order. But that fix doesn't get you very far if you haven't got a
> > > significant chunk of program to work with.
> > > 
> > > 
> > > Nick
> > > ______________________________ _________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu <mailto: llvm-commits at cs.uiuc. edu >
> > > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> > > < http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits >
> > > 
> > > 
> > > 
> > > 
> > > 
> > > ______________________________ _________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> > 
> > > 
> > > 
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > 
> > 
> > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory