[PATCHES] A module inliner pass with a greedy call site queue

Wed Aug 27 11:52:05 PDT 2014

Hi,

I Just got back.  A lot of new posts.

For global limitation, for example, cut off for code size growth, 

I found this kind of global limitation didn’t work well in some situation with greedy inliner

Because greedy inliner prioritize the order of call sites based on static analysis. Without

The real time information, the call sites with lower priority may be still the hot function. 

Profiled guided inlining should fix this problem.

Size based globa cut off has different issue, when an inlining for a call site hit the cut off, the 

Next inlining may decrease the global size below the cut off, the inlining may finish much

earlier than the expected. So we may have this kind of global limiter, but in the normal

situation, all global limiters should be applied. 

For iterative framework for inliner and other optimization, for inliner itself, I believe two

passes should be good enough, one at the very beginning of a pass queue and one at the 

end.  In general, the size of IR is very linear to the one after optimization. Some optimizations

increases the ir size, such as unrolling. For increased Ir callees, actually, for most of time, we 

do not inline it again in the iterative framework. So if we only add another pass to catch 

all decreased ir callees, it should be enough in my opinion. 

Yin 

From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-bounces at cs.uiuc.edu] On Behalf Of Xinliang David Li
Sent: Wednesday, August 20, 2014 10:45 AM
To: Hal Finkel
Cc: Jiangning Liu; LLVM Commits
Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue

On Wed, Aug 20, 2014 at 10:31 AM, Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at anl.gov> > wrote:

----- Original Message -----
> From: "Xinliang David Li" <xinliangli at gmail.com <mailto:xinliangli at gmail.com> >
> To: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at anl.gov> >
> Cc: "LLVM Commits" <llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu> >, "Jiangning Liu" <Jiangning.Liu at arm.com <mailto:Jiangning.Liu at arm.com> >, "Nick Lewycky"
> <nicholas at mxc.ca <mailto:nicholas at mxc.ca> >

> Sent: Wednesday, August 20, 2014 10:49:15 AM
> Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue
>
> On Wed, Aug 20, 2014 at 1:10 AM, Hal Finkel < hfinkel at anl.gov <mailto:hfinkel at anl.gov>  >
> wrote:
>
>
>
> ----- Original Message -----
> > From: "Xinliang David Li" < xinliangli at gmail.com <mailto:xinliangli at gmail.com>  >
>
>
> > To: "Hal Finkel" < hfinkel at anl.gov <mailto:hfinkel at anl.gov>  >
> > Cc: "LLVM Commits" < llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>  >, "Jiangning Liu" <
> > Jiangning.Liu at arm.com <mailto:Jiangning.Liu at arm.com>  >, "Nick Lewycky"
> > < nicholas at mxc.ca <mailto:nicholas at mxc.ca>  >
> > Sent: Tuesday, August 19, 2014 11:40:28 PM
> > Subject: Re: [PATCHES] A module inliner pass with a greedy call
> > site queue
> >
> > On Tue, Aug 19, 2014 at 3:09 PM, Hal Finkel < hfinkel at anl.gov <mailto:hfinkel at anl.gov>  >
> > wrote:
> >
> >
> >
> > ----- Original Message -----
> > > From: "Xinliang David Li" < xinliangli at gmail.com <mailto:xinliangli at gmail.com>  >
> > > To: "Nick Lewycky" < nicholas at mxc.ca <mailto:nicholas at mxc.ca>  >
> > > Cc: "LLVM Commits" < llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>  >, "Jiangning Liu"
> > > <
> > > Jiangning.Liu at arm.com <mailto:Jiangning.Liu at arm.com>  >
> > > Sent: Friday, August 8, 2014 3:18:55 AM
> > > Subject: Re: [PATCHES] A module inliner pass with a greedy call
> > > site queue
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > > Global inliner is the word I use for priority queue based
> > > inliner.
> > >
> > >
> > > 1) it does not define a particular inlining order
> > > 2) it can be modeled to implement strict bottom-up or top-down
> > > order
> > > 3) the analysis can be performed 'globally' on call chains
> > > instead
> > > of
> > > just caller-callee pair.
> > > 4) it is not necessarily 'greedy'.
> > >
> > >
> > >
> > >
> > >
> > >
> > > I have a strong problem with global metrics. Things like "only
> > > allow
> > > X% code size growth" mean that whether I inline this callsite can
> > > depend on seemingly unrelated factors like how many other
> > > functions
> > > are in the same module, even outside the call stack at hand.
> > > Similarly for other things like cutoffs about how many inlinings
> > > are
> > > to be performed (now it depends on traversal order, and if you
> > > provide the inliner with a more complete program then it may
> > > chose
> > > to not inline calls it otherwise would have). I don't like spooky
> > > action at a distance, it's hard to predict and hard to debug.
> > >
> > >
> > >
> > > yes, global cutoff is a poor man's method to model 'inlining cost
> > > >
> > > benefit'. However, it does not mean the global inliner can not do
> > > better. Using cutoff is not inherent to the global inliner,
> > > though
> > > the most common approximation.
> >
> > I agree with Nick, having module changes affect inlining of
> > functions
> > in no way related except for the fact that they happen to be in the
> > same module is not acceptable. We must think of a better way. If
> > you
> > have ideas on how we might do this, please elaborate on them. I
> > suspect there is some disconnected subgraph localization that can
> > be
> > applied.
> >
> >
> >
> > It is undoubtedly bad when you get different inlining decisions
> > when
> > you add or remove some unrelated stuff from a module.
>
> Good, we're all on the same page then :-) Nevertheless, I consider it
> to be a requirement that this not happen (please keep in mind that
> not all LLVM modules come from C/C++ source files, but are generated
> by all kinds of things). I see no reason why we could not partition
> the call graph into disconnected components and only apply the limit
> per component. Perhaps not a spectacular solution, but it seems
> practical.
>
>
> > However in
> > reality for a well designed inliner which has other heuristics or
> > filtering based on code analysis, the module limit is actually not
> > likely to be hit before the queue is exhausted (for smaller
> > modules,
> > the growth budget can be larger). The limit is there to prevent
> > extreme cases.
>
> It would be good to know how often this limit is actually hit in
> practice. Does it ever happen in SPEC, or the LLVM test-suite or
> during self hosting, etc.?
>
>
>
>
>
> SPEC is not interesting as it the source won't change.

I don't understand this sentence.

If the problem people worrying about is random performance changes due to unrelated source changes, than SPEC is not relevant.

> Please do
> remember that that callsites hitting the global limit are usually
> 'very low' in ranking for inlining, so in theory they should not
> matter much for performance. If it does swing performance badly, you
> end up with a bigger problem to solve --- fix the inline heurisitc
> to hoist the priority for that callsite. Half jokingly, I consider
> this (global limit) a feature (to find opportunities) :).

I understand your point, and I realize you're half joking (and so I'm half-awkwardly still being serious), but there are much better ways of getting feedback from users than having them complain about random performance variations,

Yes - there might be better ways -- but I won't count on users to be able to report those (missing inlines) though.

and I won't be laughing after I need to track some of them down. We do need to collect good statistics, but that can be done with some appropriate infrastructure.

I don't have good statistics, but over the last couple of years, I only remember 1 or 2 cases where users reported problems like this -- and the test cases are also micro benchmarks (which are sensitive to basically anything). For large apps, the record is 0.  

thanks,

David

Thanks again,
Hal

>
>
> David
>
>
>
>
>
> Thanks again,
> Hal
>
>
>
> >
> >
> > David
> >
> >
> >
> >
> >
> >
> >
> > -Hal
> >
> >
> >
> > >
> > >
> > >
> > > We *do* want more context in the inliner, that's the largest
> > > known
> > > deficiency of our current one. Again, the pass manager rewrite is
> > > taking place to allow the inliner to call into function analysis
> > > passes so that we can have more context available when making our
> > > inlining decision. It's just a long, slow path to getting what we
> > > want.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Algorithms such as a bottom-up inliner
> > >
> > >
> > > analyze a callsite and assign it a value. This could be bottom-up
> > > or
> > > top-down, it doesn't really matter. What matters is that
> > > eventually,
> > > all
> > > (rational) callsites end up in the same sorted datastructure and
> > > are
> > > addressed in order.
> > >
> > > Am I missing something?
> > >
> > > The current inliner doesn't assign values across the whole call
> > > graph
> > > then decide where to inline.
> > >
> > > Firstly, the local decision (looking at a single caller-callee
> > > pair
> > > through a particular call site) works by attempting to determine
> > > how
> > > much of the callee will be live given the values known at the
> > > caller. For instance, we will resolve a switch statement to its
> > > destination block, and potentially eliminate other callees. These
> > > simplifications would still be possible even if we calculated
> > > everything up front.
> > >
> > > Secondly, we iterate with the function passes optimizing the new
> > > function after each inlining is performed. This may eliminate
> > > dead
> > > code (potentially removing call graph edges) and can resolve
> > > loads
> > > (potentially creating new call graph edges as indirect calls are
> > > resolved to direct calls). Handling the CFG updates is one of the
> > > more interesting and difficult parts of the inliner, and it's
> > > very
> > > important for getting C++ virtual calls right. This sort of thing
> > > can't be calculated up front.
> > >
> > > Nick
> > >
> > > PS. You may have guessed that I'm just plain prejudiced against
> > > top-down inliners. I am, and I should call that out before going
> > > too
> > > far down into the discussion.
> > >
> > > In the past I've seem them used for their ability to game
> > > benchmarks
> > > (that's my side of the story, not theirs). You provide an inliner
> > > with tweakable knobs that have really messy complicated
> > > interactions
> > > all across the inliner depending on all sorts of things, then you
> > > select the numbers that happen to give you a 20% speed up on SPEC
> > > for no good reason, and call it success. Attribute the success to
> > > the flexibility provided by the design.
> > >
> > >
> > >
> > >
> > > I have seen compiler to add benchmark specific hacks, but I have
> > > also
> > > seen compiler that does excellent job implementing generally
> > > useful
> > > inlining heuristics (cost/benefit functions) based on study of
> > > SPEC
> > > benchmarks and cross validate them on large ISV programs such as
> > > database severs. Think about this: if you can tune the parameter
> > > to
> > > speed up one benchmark 20% without degrading others, even though
> > > the
> > > tuning itself maybe bogus, it proves the fact the global inliner
> > > is
> > > quite flexible and tunable. Pure bottom-up inliner will find a
> > > hard
> > > time doing so.
> > >
> > >
> > > Having said this, getting the global inliner work right may take
> > > years of refinement and tuning to get it right. One thing is that
> > > it
> > > can not rely on the on-the-fly cleanups/scalar ops to get precise
> > > summaries.
> > >
> > >
> > > David
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 6 August 2014 08:54, Nick Lewycky < nicholas at mxc.ca <mailto:nicholas at mxc.ca> 
> > >
> > > <mailto: nicholas at mxc.ca <mailto:nicholas at mxc.ca>  >> wrote:
> > >
> > > Hal Finkel wrote:
> > >
> > > I'd like you to elaborate on your assertion here, however, that
> > > a "topdown inliner is going to work best when you have the whole
> > > program." It seems to me that, whole program or not, a top-down
> > > inlining approach is exactly what you want to avoid the
> > > vector-push_back-cold-path- inlining problem (because, from the
> > >
> > >
> > > caller, you see many calls to push_back, which is small --
> > > because the hot path is small and the cold path has not (yet)
> > > been inlined -- and inlines them all, at which point it can make
> > > a sensible decision about the cold-path calls).
> > >
> > >
> > > I don't see that. You get the same information when looking at a
> > > pair of functions and deciding whether to inline. With the
> > > bottom-up
> > > walk, we analyze the caller and callee in their entirety before
> > > deciding whether to inline. I assume a top-down inliner would do
> > > the
> > > same.
> > >
> > > If you have a top-down traversal and you don't have the whole
> > > program, the first problem you have is a whole ton of starting
> > > points. At first blush bottom up seems to have the same problem,
> > > except that they are generally very straight-forward leaf
> > > functions
> > > -- setters and getters or little loops to test for a property.
> > > Top
> > > down you don't yet know what you've got, and it has lots of calls
> > > that may access arbitrary memory. In either case, you apply your
> > > metric to inline or not. Then you run the function-level passes
> > > to
> > > perform simplification. Bottom up, you're much more likely to get
> > > meaningful simplifications -- your getter/setter melts away. Top
> > > down you may remove some redundant loads or dead stores, but you
> > > still don't know what's going on because you have these opaque
> > > not-yet-analyzed callees in the way. If you couldn't analyze the
> > > memory before, inlining one level away hasn't helped you, and the
> > > function size has grown. You don't get the simplifications until
> > > you
> > > go all the way down the call stack to the setters and getters
> > > etc.
> > >
> > > There's a fix for this, and that's to perform a sort of symbolic
> > > execution and just keep track of what the program has done so far
> > > (ie. what values registers have taken on so far, which pointers
> > > have
> > > escaped etc.), and make each inlining decision in program
> > > execution
> > > order. But that fix doesn't get you very far if you haven't got a
> > > significant chunk of program to work with.
> > >
> > >
> > > Nick
> > > ______________________________ _________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>  <mailto: llvm-commits at cs.uiuc <mailto:llvm-commits at cs.uiuc> . edu >
> > > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> > > < http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits >
> > >
> > >
> > >
> > >
> > >
> > > ______________________________ _________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu> 
> > > http://lists.cs.uiuc.edu/ mailman/listinfo/llvm-commits
> >
> > >
> > >
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu> 
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > >
> >
> >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140827/ca7b4dd9/attachment.html>