[LLVMdev] inlining with O3 and O4

Tue Aug 28 23:01:37 PDT 2012

On Wed, Aug 29, 2012 at 1:55 AM, Chandler Carruth <chandlerc at google.com> wrote:
> On Tue, Aug 28, 2012 at 10:39 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>
>> On Wed, Aug 29, 2012 at 12:11 AM, Ramanarayanan, Ramshankar
>> <Ramshankar.Ramanarayanan at amd.com> wrote:
>> > I am wondering how O4 vs O3 do inlining. With O4 it looks like inlining
>> > is
>> > done first on each file and then at linking phase. Wouldn’t it be a
>> > better
>> > alternative to delay inlining decisions until the link stage?
>> Yes and no.
>> Yes in the sense that you may make some better decisions.
>> No in the sense that you will end up with larger modules (assuming
>> some simple early CSE/etc is also done), and as a result of having
>> done no inlining, may make worse decisions at the link stage inlining,
>> depending on what IPA analysis you base your link stage inlining on
>> and when it runs.
>>
>> It's certainly possible to have a link-phase only early inliner, and a
>> link-phase only later inliner, and you will, in general, get better
>> decisions than a local inliner + link phase inliner, but the cost you
>> pay is more memory usage, more disk usage, etc.
>
>
> I'm curious -- where do you draw these conclusions from?

Watching 4 compilers (ICC, XLC, GCC, Open64) go through about 10 years
worth of rewriting inliners every few years ;)

>
> With the current LLVM inliner (significant portions of which are quite new)
> I would not expect bad decisions by delaying inlining until link time. In
> fact, there are a large number of heuristics we use during per-module
> inlining which make *zero* sense if you eventually perform LTO.

Sure.  The heuristics tend to become more complex over time, however,
and require more analysis (IE "oh, i'm statistically likely to be able
to eliminate large parts of this function because it will become
constant" or "oh, inlining this performance critical function will
enable us to eliminate  loads of otherwise undecidable-aliasing
pointers").  That analysis is usually stymied by lack of inlining and
simple CSE/dead code elimination (because in order to be fast, it's
usually not flow sensitive and has no concept of whether the code will
ever be executed).

I wouldn't disagree that with the exact current heuristics i see in
the inliner, you could delay all decisions until later and get better
results.

>
> A very long-standing todo of mine is to build a per-module set of passes for
> LTO builds that is very carefully chosen to be information preserving and
> avoid decisions which can be better made at LTO-time. I suspect that we
> would see significantly better LTO results from this, but of course only an
> experiment will show. My hunch is because the optimization passes in LLVM
> have been heavily tuned for the information available in the per-module
> pass, and many of them will be ineffective if run after. The inliner is a
> good example here. We specifically evaluate potential future inlining
> opportunities when making a particular inlining decision. Doing that
> per-module when you will eventually have total information seems flawed.
You are assuming the analysis you will run to evaluate future inlining
opportunities would not impact what that total information says :)

In a perfect world, you are right.
It's always better to delay decisions to the latest point and until
you have all possible information.
In a practical world, analysis and passes that this "all possible
information" consists of, are affected by the decisions you are now
delaying.