<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Aug 20, 2014 at 10:31 AM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">----- Original Message -----<br>

> From: "Xinliang David Li" <<a href="mailto:xinliangli@gmail.com">xinliangli@gmail.com</a>><br>

> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>

> Cc: "LLVM Commits" <<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a>>, "Jiangning Liu" <<a href="mailto:Jiangning.Liu@arm.com">Jiangning.Liu@arm.com</a>>, "Nick Lewycky"<br>


> <<a href="mailto:nicholas@mxc.ca">nicholas@mxc.ca</a>><br>

</div><div><div class="h5">> Sent: Wednesday, August 20, 2014 10:49:15 AM<br>

> Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue<br>

><br>

> On Wed, Aug 20, 2014 at 1:10 AM, Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>

> wrote:<br>

><br>

><br>

><br>

> ----- Original Message -----<br>

> > From: "Xinliang David Li" < <a href="mailto:xinliangli@gmail.com">xinliangli@gmail.com</a> ><br>

><br>

><br>

> > To: "Hal Finkel" < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>

> > Cc: "LLVM Commits" < <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a> >, "Jiangning Liu" <<br>

> > <a href="mailto:Jiangning.Liu@arm.com">Jiangning.Liu@arm.com</a> >, "Nick Lewycky"<br>

> > < <a href="mailto:nicholas@mxc.ca">nicholas@mxc.ca</a> ><br>

> > Sent: Tuesday, August 19, 2014 11:40:28 PM<br>

> > Subject: Re: [PATCHES] A module inliner pass with a greedy call<br>

> > site queue<br>

> ><br>

> > On Tue, Aug 19, 2014 at 3:09 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>

> > wrote:<br>

> ><br>

> ><br>

> ><br>

> > ----- Original Message -----<br>

> > > From: "Xinliang David Li" < <a href="mailto:xinliangli@gmail.com">xinliangli@gmail.com</a> ><br>

> > > To: "Nick Lewycky" < <a href="mailto:nicholas@mxc.ca">nicholas@mxc.ca</a> ><br>

> > > Cc: "LLVM Commits" < <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a> >, "Jiangning Liu"<br>

> > > <<br>

> > > <a href="mailto:Jiangning.Liu@arm.com">Jiangning.Liu@arm.com</a> ><br>

> > > Sent: Friday, August 8, 2014 3:18:55 AM<br>

> > > Subject: Re: [PATCHES] A module inliner pass with a greedy call<br>

> > > site queue<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> ><br>

> > > Global inliner is the word I use for priority queue based<br>

> > > inliner.<br>

> > ><br>

> > ><br>

> > > 1) it does not define a particular inlining order<br>

> > > 2) it can be modeled to implement strict bottom-up or top-down<br>

> > > order<br>

> > > 3) the analysis can be performed 'globally' on call chains<br>

> > > instead<br>

> > > of<br>

> > > just caller-callee pair.<br>

> > > 4) it is not necessarily 'greedy'.<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > I have a strong problem with global metrics. Things like "only<br>

> > > allow<br>

> > > X% code size growth" mean that whether I inline this callsite can<br>

> > > depend on seemingly unrelated factors like how many other<br>

> > > functions<br>

> > > are in the same module, even outside the call stack at hand.<br>

> > > Similarly for other things like cutoffs about how many inlinings<br>

> > > are<br>

> > > to be performed (now it depends on traversal order, and if you<br>

> > > provide the inliner with a more complete program then it may<br>

> > > chose<br>

> > > to not inline calls it otherwise would have). I don't like spooky<br>

> > > action at a distance, it's hard to predict and hard to debug.<br>

> > ><br>

> > ><br>

> > ><br>

> > > yes, global cutoff is a poor man's method to model 'inlining cost<br>

> > > ><br>

> > > benefit'. However, it does not mean the global inliner can not do<br>

> > > better. Using cutoff is not inherent to the global inliner,<br>

> > > though<br>

> > > the most common approximation.<br>

> ><br>

> > I agree with Nick, having module changes affect inlining of<br>

> > functions<br>

> > in no way related except for the fact that they happen to be in the<br>

> > same module is not acceptable. We must think of a better way. If<br>

> > you<br>

> > have ideas on how we might do this, please elaborate on them. I<br>

> > suspect there is some disconnected subgraph localization that can<br>

> > be<br>

> > applied.<br>

> ><br>

> ><br>

> ><br>

> > It is undoubtedly bad when you get different inlining decisions<br>

> > when<br>

> > you add or remove some unrelated stuff from a module.<br>

><br>

> Good, we're all on the same page then :-) Nevertheless, I consider it<br>

> to be a requirement that this not happen (please keep in mind that<br>

> not all LLVM modules come from C/C++ source files, but are generated<br>

> by all kinds of things). I see no reason why we could not partition<br>

> the call graph into disconnected components and only apply the limit<br>

> per component. Perhaps not a spectacular solution, but it seems<br>

> practical.<br>

><br>

><br>

> > However in<br>

> > reality for a well designed inliner which has other heuristics or<br>

> > filtering based on code analysis, the module limit is actually not<br>

> > likely to be hit before the queue is exhausted (for smaller<br>

> > modules,<br>

> > the growth budget can be larger). The limit is there to prevent<br>

> > extreme cases.<br>

><br>

> It would be good to know how often this limit is actually hit in<br>

> practice. Does it ever happen in SPEC, or the LLVM test-suite or<br>

> during self hosting, etc.?<br>

><br>

><br>

><br>

><br>

><br>

> SPEC is not interesting as it the source won't change.<br>

<br>

</div></div>I don't understand this sentence.<br></blockquote><div><br></div><div><br></div><div>If the problem people worrying about is random performance changes due to unrelated source changes, than SPEC is not relevant.</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class=""><br>

> Please do<br>

> remember that that callsites hitting the global limit are usually<br>

> 'very low' in ranking for inlining, so in theory they should not<br>

> matter much for performance. If it does swing performance badly, you<br>

> end up with a bigger problem to solve --- fix the inline heurisitc<br>

> to hoist the priority for that callsite. Half jokingly, I consider<br>

> this (global limit) a feature (to find opportunities) :).<br>

<br>

</div>I understand your point, and I realize you're half joking (and so I'm half-awkwardly still being serious), but there are much better ways of getting feedback from users than having them complain about random performance variations,</blockquote>

<div><br></div><div>Yes - there might be better ways -- but I won't count on users to be able to report those (missing inlines) though.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

and I won't be laughing after I need to track some of them down. We do need to collect good statistics, but that can be done with some appropriate infrastructure.<br></blockquote><div><br></div><div>I don't have good statistics, but over the last couple of years, I only remember 1 or 2 cases where users reported problems like this -- and the test cases are also micro benchmarks (which are sensitive to basically anything). For large apps, the record is 0.  </div>

<div><br></div><div><br></div><div>thanks,</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Thanks again,<br>

Hal<br>

<div class="HOEnZb"><div class="h5"><br>

><br>

><br>

> David<br>

><br>

><br>

><br>

><br>

><br>

> Thanks again,<br>

> Hal<br>

><br>

><br>

><br>

> ><br>

> ><br>

> > David<br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> > -Hal<br>

> ><br>

> ><br>

> ><br>

> > ><br>

> > ><br>

> > ><br>

> > > We *do* want more context in the inliner, that's the largest<br>

> > > known<br>

> > > deficiency of our current one. Again, the pass manager rewrite is<br>

> > > taking place to allow the inliner to call into function analysis<br>

> > > passes so that we can have more context available when making our<br>

> > > inlining decision. It's just a long, slow path to getting what we<br>

> > > want.<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > Algorithms such as a bottom-up inliner<br>

> > ><br>

> > ><br>

> > > analyze a callsite and assign it a value. This could be bottom-up<br>

> > > or<br>

> > > top-down, it doesn't really matter. What matters is that<br>

> > > eventually,<br>

> > > all<br>

> > > (rational) callsites end up in the same sorted datastructure and<br>

> > > are<br>

> > > addressed in order.<br>

> > ><br>

> > > Am I missing something?<br>

> > ><br>

> > > The current inliner doesn't assign values across the whole call<br>

> > > graph<br>

> > > then decide where to inline.<br>

> > ><br>

> > > Firstly, the local decision (looking at a single caller-callee<br>

> > > pair<br>

> > > through a particular call site) works by attempting to determine<br>

> > > how<br>

> > > much of the callee will be live given the values known at the<br>

> > > caller. For instance, we will resolve a switch statement to its<br>

> > > destination block, and potentially eliminate other callees. These<br>

> > > simplifications would still be possible even if we calculated<br>

> > > everything up front.<br>

> > ><br>

> > > Secondly, we iterate with the function passes optimizing the new<br>

> > > function after each inlining is performed. This may eliminate<br>

> > > dead<br>

> > > code (potentially removing call graph edges) and can resolve<br>

> > > loads<br>

> > > (potentially creating new call graph edges as indirect calls are<br>

> > > resolved to direct calls). Handling the CFG updates is one of the<br>

> > > more interesting and difficult parts of the inliner, and it's<br>

> > > very<br>

> > > important for getting C++ virtual calls right. This sort of thing<br>

> > > can't be calculated up front.<br>

> > ><br>

> > > Nick<br>

> > ><br>

> > > PS. You may have guessed that I'm just plain prejudiced against<br>

> > > top-down inliners. I am, and I should call that out before going<br>

> > > too<br>

> > > far down into the discussion.<br>

> > ><br>

> > > In the past I've seem them used for their ability to game<br>

> > > benchmarks<br>

> > > (that's my side of the story, not theirs). You provide an inliner<br>

> > > with tweakable knobs that have really messy complicated<br>

> > > interactions<br>

> > > all across the inliner depending on all sorts of things, then you<br>

> > > select the numbers that happen to give you a 20% speed up on SPEC<br>

> > > for no good reason, and call it success. Attribute the success to<br>

> > > the flexibility provided by the design.<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > I have seen compiler to add benchmark specific hacks, but I have<br>

> > > also<br>

> > > seen compiler that does excellent job implementing generally<br>

> > > useful<br>

> > > inlining heuristics (cost/benefit functions) based on study of<br>

> > > SPEC<br>

> > > benchmarks and cross validate them on large ISV programs such as<br>

> > > database severs. Think about this: if you can tune the parameter<br>

> > > to<br>

> > > speed up one benchmark 20% without degrading others, even though<br>

> > > the<br>

> > > tuning itself maybe bogus, it proves the fact the global inliner<br>

> > > is<br>

> > > quite flexible and tunable. Pure bottom-up inliner will find a<br>

> > > hard<br>

> > > time doing so.<br>

> > ><br>

> > ><br>

> > > Having said this, getting the global inliner work right may take<br>

> > > years of refinement and tuning to get it right. One thing is that<br>

> > > it<br>

> > > can not rely on the on-the-fly cleanups/scalar ops to get precise<br>

> > > summaries.<br>

> > ><br>

> > ><br>

> > > David<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > On 6 August 2014 08:54, Nick Lewycky < <a href="mailto:nicholas@mxc.ca">nicholas@mxc.ca</a><br>

> > ><br>

> > > <mailto: <a href="mailto:nicholas@mxc.ca">nicholas@mxc.ca</a> >> wrote:<br>

> > ><br>

> > > Hal Finkel wrote:<br>

> > ><br>

> > > I'd like you to elaborate on your assertion here, however, that<br>

> > > a "topdown inliner is going to work best when you have the whole<br>

> > > program." It seems to me that, whole program or not, a top-down<br>

> > > inlining approach is exactly what you want to avoid the<br>

> > > vector-push_back-cold-path- inlining problem (because, from the<br>

> > ><br>

> > ><br>

> > > caller, you see many calls to push_back, which is small --<br>

> > > because the hot path is small and the cold path has not (yet)<br>

> > > been inlined -- and inlines them all, at which point it can make<br>

> > > a sensible decision about the cold-path calls).<br>

> > ><br>

> > ><br>

> > > I don't see that. You get the same information when looking at a<br>

> > > pair of functions and deciding whether to inline. With the<br>

> > > bottom-up<br>

> > > walk, we analyze the caller and callee in their entirety before<br>

> > > deciding whether to inline. I assume a top-down inliner would do<br>

> > > the<br>

> > > same.<br>

> > ><br>

> > > If you have a top-down traversal and you don't have the whole<br>

> > > program, the first problem you have is a whole ton of starting<br>

> > > points. At first blush bottom up seems to have the same problem,<br>

> > > except that they are generally very straight-forward leaf<br>

> > > functions<br>

> > > -- setters and getters or little loops to test for a property.<br>

> > > Top<br>

> > > down you don't yet know what you've got, and it has lots of calls<br>

> > > that may access arbitrary memory. In either case, you apply your<br>

> > > metric to inline or not. Then you run the function-level passes<br>

> > > to<br>

> > > perform simplification. Bottom up, you're much more likely to get<br>

> > > meaningful simplifications -- your getter/setter melts away. Top<br>

> > > down you may remove some redundant loads or dead stores, but you<br>

> > > still don't know what's going on because you have these opaque<br>

> > > not-yet-analyzed callees in the way. If you couldn't analyze the<br>

> > > memory before, inlining one level away hasn't helped you, and the<br>

> > > function size has grown. You don't get the simplifications until<br>

> > > you<br>

> > > go all the way down the call stack to the setters and getters<br>

> > > etc.<br>

> > ><br>

> > > There's a fix for this, and that's to perform a sort of symbolic<br>

> > > execution and just keep track of what the program has done so far<br>

> > > (ie. what values registers have taken on so far, which pointers<br>

> > > have<br>

> > > escaped etc.), and make each inlining decision in program<br>

> > > execution<br>

> > > order. But that fix doesn't get you very far if you haven't got a<br>

> > > significant chunk of program to work with.<br>

> > ><br>

> > ><br>

> > > Nick<br>

> > > ______________________________ _________________<br>

> > > llvm-commits mailing list<br>

> > > <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a> <mailto: llvm-commits@cs.uiuc. edu ><br>

> > > <a href="http://lists.cs.uiuc.edu/" target="_blank">http://lists.cs.uiuc.edu/</a> mailman/listinfo/llvm-commits<br>

> > > < <a href="http://lists.cs.uiuc.edu/" target="_blank">http://lists.cs.uiuc.edu/</a> mailman/listinfo/llvm-commits ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > ______________________________ _________________<br>

> > > llvm-commits mailing list<br>

> > > <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

> > > <a href="http://lists.cs.uiuc.edu/" target="_blank">http://lists.cs.uiuc.edu/</a> mailman/listinfo/llvm-commits<br>

> ><br>

> > ><br>

> > ><br>

> > > _______________________________________________<br>

> > > llvm-commits mailing list<br>

> > > <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

> > > <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

> > ><br>

> ><br>

> ><br>

> ><br>

> > --<br>

> > Hal Finkel<br>

> > Assistant Computational Scientist<br>

> > Leadership Computing Facility<br>

> > Argonne National Laboratory<br>

> ><br>

> ><br>

><br>

> --<br>

> Hal Finkel<br>

> Assistant Computational Scientist<br>

> Leadership Computing Facility<br>

> Argonne National Laboratory<br>

><br>

><br>

<br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</div></div></blockquote></div><br></div></div>