[PATCHES] A module inliner pass with a greedy call site queue

Tue Jul 29 00:48:36 PDT 2014

Yin Ma wrote:
> Hello,
>
> This patch is an implementation of module inliner pass with a greedy
> call site queue. This
>
> greedy inliner reuses the existing SCC inliner to do the local decision
> and inlining work. It can
>
> improve Aarch64 SPEC2000 eon 16% and mesa 5% compared with LLVM with
> default inliner on
>
> a real cortex-a53 devices. (-O3 -mllvm -inline-perf-mode=true -mllvm
> -greedy-inliner=true)

A few points. A topdown inliner is going to work best when you have the 
whole program, like when doing LTO or something like gcc singlesource.

What I *really* want to see is a major change to the way we do 
optimizations when we think we have the whole program (for shorthand I 
say "in LTO" but that doesn't need to be the same thing). We should have 
a top-down CFG walk first which does optimizations structured like 
symbolic execution and works very hard to prune reachable functions, 
preventing us from ever loading them out of the .bc file. Then we should 
do our usual bottom-up optimization.

> Compared with SCC inliner, which is bottom up fixed order, the greedy
> inliner utilizes a global
>
> Call site queue with greedy weight computation algorithm to provide more
> flexible in the call
>
> site decision.

Heh, I recognize this design. :) A "total amount of inlining done in the 
program" metric.

  It can be implemented as top-down order or any other
> order you like to do inlining
>
> work. And the speed of greedy inliner is almost as same as the SCC
> inliner. Because the different
>
> order setup, this inliner could be an alternative solution to bring up
> performance or reduce code
>
> size. In our experiment, this greedy inliner also did better jobs in –Os
> mode over the default LLVM
>
> inliner.

Sure, but *why*? Inlining is a famously fickle problem, and it's 
critical to get it right. We know we get it wrong, and that leads to 
problems where we inline too little, and also problems where we inline 
too much. What does your inliner do to bzip2? or snappy? Inlining the 
slow path of a vector push_back is a huge performance problem.

Our inliner is known bad at the moment. It's better than it used to be, 
but in order to make it properly better, we need to make it use other 
llvm function analyses, which SCC passes can't do. That's what's 
motivating Chandler's pass manager work.

> Please give a review.

I think the first thing we need is to understand why this is the right 
approach to inlining. Explain further how it decides what to inline, how 
it affects different languages (are you aware of the current inliner's 
SCC refinement trick? and how that impacts C++ virtual dispatch in 
particular?), how it works on different CPUs, how it affects compile 
times, how it affects generated code sizes, etc.

Nick