[PATCHES] A module inliner pass with a greedy call site queue

Wed Jul 30 15:46:33 PDT 2014

Hello,

Thank you all for replying my post. I think I have to come up some
background at first. The motivation for this work comes from analyzing
performance degradation of LLVM compared GCC 4.9. We found the current SCC
inliner cannot inline a critical function in one of the SPEC2000 benchmarks
unless we increase the threshold to a very large number. Like A calls B
calls C, The SCC inliner will start B -> C and inlined C into B, however,
the performance gain is B to A and B is in a loop of A. However, after
inlining C to B, B becomes very large so B cannot be inlined to A with
default inline threshold. However, Increasing threshold to catch this
function case dramatically increases code size and decreases the performance
of other cases. We would like to consider B to A before C to B. Therefore we
started this work.

The flow of this inliner is like this.
1.	Collect every known call sites in the module
2.	Weight them
3.	Try to Inline the callsite with the best weight
4.	Resort the queue
5.	Keep inlining until a threshold hits
I implemented it by my own based on my understanding of how inlining should
work for our user cases. Using a weighted global queue is a common approach
to do inlining decision for example GCC has a global queue with weights. The
novelty is its integration with the current SCC inliner. It does basically
no change to the current SCC inliner and it calles the SCC inliner to do
local decision and inlining work. 

This greedy inliner has passed all our internal benchmarks and testing code,
including SPEC2000/2006 and user code like a very large c++ program. But
tuning is just preliminary and they are not thoroughly investigated. I
evaluated this inliner on several our user cases, including SPEC 2000 for
ARM/AArch64. I expect the community to participate in fine tuning the
framework and heuristic. 

Inlining is a heuristic problem.  SCC inliner is SCC pass. Greedy inliner is
module pass. SCC inliner follows the fixed bottom up  order of SCC.  This
one uses a global queue to change order. Queue considering all callsites
should provide a very good flexibility to prioritize the order of inlining
callsite in order to fit some certain situations. The difficult part is to
have a right equation to compute the weight. It can be further improved. We
have already observed many scenarios and this inliner handle the inlining
properly like the A-B-C problem I mentioned at first.

In general,  I believe this inliner could be a good second inliner in
parallel with the current SCC inliner. It has significantly different
framework to make it possible to cover different  situation SCC inliner may
be weak. Its flexibility, like whole program queue, priority based selection
and resorting, should make it possible to do top down order, possible to
prioritize callsite in loopnest and also possible to do bottom up. It can be
achieved by using different  weight computation equation. Besides a general
good weight equation, people even can add their own tricks or handling code
for their special need, like target dependent bonus or something.

The weight equation in this patch is tuned by several user cases we have. I
have done preliminary tuning for them and there is room for further
investigation and tuning.  We did evaluate this inliner against GCC, for
example,  a very large C++ program with a lot of if-else and calls. Code
size is a  matter for this program. -Os is used everywhere. Under -Os, the
new inliner performs closer to what GCC does and better than SCC llvm
inliner. Because, SCC inliner does bottom up, it will inline the leaf
callsite at first, which may be less likely to be called. The module inliner
will consider callsite from outermost to innermost, where leads the better
performance. We can see module inliner did inline more function but produce
smaller code size than SCC inliner for our sampled program. GCC does better
job on this program due to partial inlining and ipa function cloning that
current LLVM doesn't have.  

Under -O3, for several benchmarks like SPEC2000/2006 we tests, it performs
in general par with the current GCC/LLVM inliner. Because we have weight
computation considering loop nest depth and other factors, it is why eon and
mesa is faster.

In pass order, this module based inliner pass replaces the original SCC
inliner pass. It has some impact on the order of passes in pass manager,
basically SCC based pass. It should not impact C++ virtual dispatch too much
in my opinion. For LTO, it can consider all global callsites. It also can
apply different threshold at byte code generation and LTO processing phase
after adding the corresponding handler.

Thanks,

Yin

-----Original Message-----
From: James Molloy [mailto:james.molloy at arm.com] 
Sent: Tuesday, July 29, 2014 2:20 AM
To: 'Nick Lewycky'; Yin Ma
Cc: Jiangning Liu; llvm-commits at cs.uiuc.edu; Chandler Carruth
Subject: RE: [PATCHES] A module inliner pass with a greedy call site queue

Hi Yin,

This is certainly interesting, and has potential. I'm certain Chandler will
want to weigh in as inlining is well known to be his baby.

My major concern is how thoroughly this has been tested and what its effects
are. 

  * What codebases have you tested this on?
  * Most importantly, how did you evaluate its performance compared to GCC's
inlining algorithm and the current inliner?
  * How did you arrive at the heuristic numbers/threshold values/bonus
values you did? Was this arbitrary, hand tuned or the result of an automated
search?
  * What is this based on? Is it an algorithm you've made up yourself, or
does it have its roots in a paper somewhere?

I like that this algorithm is taking into account important factors such as
loop nest depth - I think we've been missing this for a while.

Cheers,

James

-----Original Message-----
From: llvm-commits-bounces at cs.uiuc.edu
[mailto:llvm-commits-bounces at cs.uiuc.edu] On Behalf Of Nick Lewycky
Sent: 29 July 2014 08:49
To: Yin Ma
Cc: Jiangning Liu; llvm-commits at cs.uiuc.edu
Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue

Yin Ma wrote:
> Hello,
>
> This patch is an implementation of module inliner pass with a greedy 
> call site queue. This
>
> greedy inliner reuses the existing SCC inliner to do the local 
> decision and inlining work. It can
>
> improve Aarch64 SPEC2000 eon 16% and mesa 5% compared with LLVM with 
> default inliner on
>
> a real cortex-a53 devices. (-O3 -mllvm -inline-perf-mode=true -mllvm
> -greedy-inliner=true)

A few points. A topdown inliner is going to work best when you have the
whole program, like when doing LTO or something like gcc singlesource.

What I *really* want to see is a major change to the way we do optimizations
when we think we have the whole program (for shorthand I say "in LTO" but
that doesn't need to be the same thing). We should have a top-down CFG walk
first which does optimizations structured like symbolic execution and works
very hard to prune reachable functions, preventing us from ever loading them
out of the .bc file. Then we should do our usual bottom-up optimization.

> Compared with SCC inliner, which is bottom up fixed order, the greedy 
> inliner utilizes a global
>
> Call site queue with greedy weight computation algorithm to provide 
> more flexible in the call
>
> site decision.

Heh, I recognize this design. :) A "total amount of inlining done in the
program" metric.

  It can be implemented as top-down order or any other
> order you like to do inlining
>
> work. And the speed of greedy inliner is almost as same as the SCC 
> inliner. Because the different
>
> order setup, this inliner could be an alternative solution to bring up 
> performance or reduce code
>
> size. In our experiment, this greedy inliner also did better jobs in 
> -Os mode over the default LLVM
>
> inliner.

Sure, but *why*? Inlining is a famously fickle problem, and it's critical to
get it right. We know we get it wrong, and that leads to problems where we
inline too little, and also problems where we inline too much. What does
your inliner do to bzip2? or snappy? Inlining the slow path of a vector
push_back is a huge performance problem.

Our inliner is known bad at the moment. It's better than it used to be, but
in order to make it properly better, we need to make it use other llvm
function analyses, which SCC passes can't do. That's what's motivating
Chandler's pass manager work.

> Please give a review.

I think the first thing we need is to understand why this is the right
approach to inlining. Explain further how it decides what to inline, how it
affects different languages (are you aware of the current inliner's SCC
refinement trick? and how that impacts C++ virtual dispatch in particular?),
how it works on different CPUs, how it affects compile times, how it affects
generated code sizes, etc.

Nick
_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits