[PATCHES] A module inliner pass with a greedy call site queue

Tue Jul 29 20:20:11 PDT 2014

----- Original Message -----
> From: "Nick Lewycky" <nicholas at mxc.ca>
> To: "Yin Ma" <yinma at codeaurora.org>
> Cc: "Jiangning Liu" <Jiangning.Liu at arm.com>, llvm-commits at cs.uiuc.edu
> Sent: Tuesday, July 29, 2014 2:48:36 AM
> Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue
> 
> Yin Ma wrote:
> > Hello,
> >
> > This patch is an implementation of module inliner pass with a
> > greedy
> > call site queue. This
> >
> > greedy inliner reuses the existing SCC inliner to do the local
> > decision
> > and inlining work. It can
> >
> > improve Aarch64 SPEC2000 eon 16% and mesa 5% compared with LLVM
> > with
> > default inliner on
> >
> > a real cortex-a53 devices. (-O3 -mllvm -inline-perf-mode=true
> > -mllvm
> > -greedy-inliner=true)
> 
> A few points. A topdown inliner is going to work best when you have
> the
> whole program, like when doing LTO or something like gcc
> singlesource.

Nick,

I would really like to see Yin's response to the question you asked at the end of your response (why is this the right approach to inlining). I'd like you to elaborate on your assertion here, however, that a "topdown inliner is going to work best when you have the whole program." It seems to me that, whole program or not, a top-down inlining approach is exactly what you want to avoid the vector-push_back-cold-path-inlining problem (because, from the caller, you see many calls to push_back, which is small -- because the hot path is small and the cold path has not (yet) been inlined -- and inlines them all, at which point it can make a sensible decision about the cold-path calls).

Most top-down approaches that I've seen fail because they reach a cut-off limit, which inevitably ends up being too small, and thus fails to inline some small leaf(-like) functions. That obviously needs to be avoided.

 -Hal

> 
> What I *really* want to see is a major change to the way we do
> optimizations when we think we have the whole program (for shorthand
> I
> say "in LTO" but that doesn't need to be the same thing). We should
> have
> a top-down CFG walk first which does optimizations structured like
> symbolic execution and works very hard to prune reachable functions,
> preventing us from ever loading them out of the .bc file. Then we
> should
> do our usual bottom-up optimization.
> 
> > Compared with SCC inliner, which is bottom up fixed order, the
> > greedy
> > inliner utilizes a global
> >
> > Call site queue with greedy weight computation algorithm to provide
> > more
> > flexible in the call
> >
> > site decision.
> 
> Heh, I recognize this design. :) A "total amount of inlining done in
> the
> program" metric.
> 
>   It can be implemented as top-down order or any other
> > order you like to do inlining
> >
> > work. And the speed of greedy inliner is almost as same as the SCC
> > inliner. Because the different
> >
> > order setup, this inliner could be an alternative solution to bring
> > up
> > performance or reduce code
> >
> > size. In our experiment, this greedy inliner also did better jobs
> > in –Os
> > mode over the default LLVM
> >
> > inliner.
> 
> Sure, but *why*? Inlining is a famously fickle problem, and it's
> critical to get it right. We know we get it wrong, and that leads to
> problems where we inline too little, and also problems where we
> inline
> too much. What does your inliner do to bzip2? or snappy? Inlining the
> slow path of a vector push_back is a huge performance problem.
> 
> Our inliner is known bad at the moment. It's better than it used to
> be,
> but in order to make it properly better, we need to make it use other
> llvm function analyses, which SCC passes can't do. That's what's
> motivating Chandler's pass manager work.
> 
> > Please give a review.
> 
> I think the first thing we need is to understand why this is the
> right
> approach to inlining. Explain further how it decides what to inline,
> how
> it affects different languages (are you aware of the current
> inliner's
> SCC refinement trick? and how that impacts C++ virtual dispatch in
> particular?), how it works on different CPUs, how it affects compile
> times, how it affects generated code sizes, etc.
> 
> Nick
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory