[PATCHES] A module inliner pass with a greedy call site queue

Wed Aug 20 01:27:28 PDT 2014

Sorry that I haven't really had time to dig into this thread in a more
detailed sense. I will try to do so...

On Wed, Aug 20, 2014 at 1:10 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> I see no reason why we could not partition the call graph into
> disconnected components and only apply the limit per component. Perhaps not
> a spectacular solution, but it seems practical.

Duncan Sands and I once talked extensively about this. The specific idea we
had was to use upper bounds on how much code is inlined (cumulatively) into
a function, and into an SCC. These would be very high upper bounds designed
specifically to avoid run-away poor behavior. I still would like to get
back to designing this.

The challenge of implementing this limit in a good way was a phase ordering
challenge: we don't know how much code we've inlined until we've optimized
it *after* inlining it. The result is that until we stop inlining and run
the rest of the optimization pass pipeline, we don't know if we should stop
inlining.

The "obvious" solution to this is to make the inliner (even more) iterative
with the primary optimization pass pipeline and phrase the threshold in a
way that naturally adjusts itself (such as the "size" of the caller) as the
optimizations unravel any abstractions.

However, that isn't really feasible with the current passes: we would have
oscillations (which we could fix) and we would spend a *spectacular* amount
of time re-computing analysis passes despite not making further changes to
the function -- it would essentially require the optimizer to iterate until
reaching a fixed point (for some likely artificial definition of a fixed
point).

I still think this is the correct design for the inliner and core optimizer
pipeline, and I'm hoping that we might be able to make analysis pass
caching and updating (rather than re-running) sufficiently aggressive to
allow exploring this in the future... but there is a lot of infrastructure
work to get there.

Also, neither Duncan nor I were able to find copious examples where the
lack of such a threshold caused runaway bad inlining decisions *and* this
had a gross impact on overall performance. The current worst impact I've
found are massive global constructor functions in Chromium which have
template-expanded calls to 1000s of small functions inlined into them with
no performance benefit and non-trivial code size growth. But so far, this
hasn't even been found to be the high order bit in Chromium for either
performance or code size, so.... :: shrug ::
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140820/da485b82/attachment.html>