[llvm-commits] PATCH: Major rewrite of inline cost analysis.

Wed Mar 28 02:52:33 PDT 2012

Hello folks!

This is a snapshot of my current work to completely re-write the inline
cost analysis. It's pretty rough, but it's a *lot* of code, and it's
growing rapidly. I wanted to get a patch out there now, get some feedback
on it, and also figure out the strategy for moving this forward (provided
there is interest, but earlier discussions seemed to indicate there was).

First, a touch of background so we're all on the same page. This rewrite
has a very specific goal, an idea that Duncan, I, and some others came up
with in discussions about the inline cost analysis's currently problems:
- Re-use as much logic from instsimplify and other optimizations as
possible rather than re-inventing it.
- Maximize the amount of *per-callsite* information available to the
analysis.
- Never count provably dead sections of the inline candidate against the
cost of inlining.
- Achieve these goals while bounding the complexity of the analysis.

The resulting design is actually much simpler than the previous design. For
each callsite, it queries for the inline cost. Using the arguments given to
the callsite, it builds up a mapping of simplifications possible. It then
walks each basic-block in the CFG, traversing in breadth-first order. The
simplifications are used to prune dead successors from the traversal of
basic blocks. For each basic block, we walk the instructions,
simultaneously analyzing their cost and looking for potential
simplifications and constant propagation specific to this callsite.

As soon as the threshold for inlining is exceeded, the entire analysis
aborts. Also, as soon as any instruction which is consider a 'never-inline'
instruction is encountered, we abort the entire analysis. By only analyzing
up to the threshold, we ensure the analysis is relatively fast and only
considers small sections of code. The only time the analysis will become
slow is due to massive simplifications due to inlining, so we have a strong
tradeoff between benefits of the analysis versus cost.

This patch implements these core elements of the design above. There are
some problems with the patch as it stands that I fully plan to fix and have
good ideas about how to fix. There is also some significant work that still
needs to be done to really maximize the advantage to this approach. More on
those below. It bootstraps, passes the nightly test suite, etc. I don't
have performance numbers from the nightly test suite (yet? trying to get
them, but proving tricky). The 'clang' binary when bootstrapped gets
somewhat larger (4%) due to a clear bug, and it's a bug which mostly
impacts cold functions in LLVM & Clang. It also gets 2% faster for -O0
compile-times of big C++ inputs, so we're already seeing benefits here.
Fixing this bug is the first of the "big areas where more work is needed".

Big question for me: What is the best strategy for making progress? I don't
want to grow this patch more out-of-tree. I can see a few options, in my
rough order of preference:
1) Commit this, possibly after some of the refactorings I mention below.
Deal with and track the fallout, regressions, etc. Iterate on it rapidly
in-tree.
2) Add some of the additional functionality to address likely regressions,
and deal with a still bigger patch in a week.
3) Refactor this patch and the inliner to have two completely different
inline systems and inline cost systems, and introduce a flag to select
between them. Check in the new code along side the old, and run experiments
in both until we're ready to have a flag day.

#1 requires the most help from other members of the community. I'll need
help getting reproductions of the regression, bitcode for inputs which
misbehave, analysis help in all likelihood, etc. However, it is also likely
the fastest way to make progress and avoids some make-work the others
require.

#2 is a bit of a compromise, but I worry it will make the review endless.

#3 is the most conservative approach, but it requires a hell of a lot of
work to setup an intermediate state that will then go away. =/ Not ideal,
but do-able if necessary.

Some problems with the patch:
- I need to add some basic regression tests, and fix a few minor
regressions surrounding dynamic-allocas, etc.
- The 'InlineCost' interface is a mess. I'm not happy. I have lots of ideas
about how to improve it, but they require serious surgery to the inliner in
addition to the inline cost analysis, so I wanted to work on them
incrementally and separately from the cost analysis logic.
- The 'InlineCostAnalysis' interface is more of a mess. Same problem as
above. I think these two interfaces are best simplified when working on the
user code, rather than their underlying implementation.
- I probably need more stats, and to collect them using the STATISTICS
stuff. I don't know the LLVM stats infrastructure well, so suggestions on
the best way to collect that would be welcome. What I have was useful for
debugging, but probably isn't right long-term.
- Currently, this will regress opt compile times in many cases. I have many
ideas about how to improve this. I've measured a 2% regression with the
current patch in my "big C++ input" test case of all if lib/Lex/*.cpp so it
isn't disastrous currently, but I wouldn't be surprised to see pretty wild
variance. Unfortunately most of my ideas to fix involve changing how the
inliner asks for the cost to ask for it less frequently. Do you detect a
theme? ;]

Some big areas where more work is needed. I have initial ideas here, but
less concrete plans:
- We need a strategy for bounding the growth of functions which call many
many small functions. I've hashed out a decent one with Duncan, but I won't
know if it works until I implement it, and I'd like to get the above done
first, and I'd like for stuff to be in-tree first.
- We need to start eagerly considering the implications of recursive
inlining. This is only possible with the new framework, and has powerful
implications like *perfect* wrapper-function removal, and recursive
specialization during inlining.
- We need to track constants passed into functions as 'references', where
they are stored to an alloca *immediately* before calling the function, and
loaded back out of it immediately within the function.

I'm sure there are other big areas where we can improve the inlining once
we have a more rational cost model, and a cost model that can more easily
be adapted to highly localized circumstances rather than a heuristic model
that has to make wild guesses based on the function's definition without
insight into its uses.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120328/2d14b004/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: inline-cost-rewrite.diff
Type: text/x-patch
Size: 86444 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120328/2d14b004/attachment.bin>