[PATCHES] A module inliner pass with a greedy call site queue

Wed Aug 20 01:36:43 PDT 2014

----- Original Message -----
> From: "Chandler Carruth" <chandlerc at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Xinliang David Li" <xinliangli at gmail.com>, "Jiangning Liu" <Jiangning.Liu at arm.com>, "LLVM Commits"
> <llvm-commits at cs.uiuc.edu>
> Sent: Wednesday, August 20, 2014 3:27:28 AM
> Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue
> 
> 
> 
> Sorry that I haven't really had time to dig into this thread in a
> more detailed sense. I will try to do so...
> 
> 
> On Wed, Aug 20, 2014 at 1:10 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> I see no reason why we could not partition the call graph into
> disconnected components and only apply the limit per component.
> Perhaps not a spectacular solution, but it seems practical.
> Duncan Sands and I once talked extensively about this. The specific
> idea we had was to use upper bounds on how much code is inlined
> (cumulatively) into a function, and into an SCC. These would be very
> high upper bounds designed specifically to avoid run-away poor
> behavior. I still would like to get back to designing this.
> 
> 
> The challenge of implementing this limit in a good way was a phase
> ordering challenge: we don't know how much code we've inlined until
> we've optimized it *after* inlining it. The result is that until we
> stop inlining and run the rest of the optimization pass pipeline, we
> don't know if we should stop inlining.
> 
> 
> The "obvious" solution to this is to make the inliner (even more)
> iterative with the primary optimization pass pipeline and phrase the
> threshold in a way that naturally adjusts itself (such as the "size"
> of the caller) as the optimizations unravel any abstractions.
> 
> 
> However, that isn't really feasible with the current passes: we would
> have oscillations (which we could fix) and we would spend a
> *spectacular* amount of time re-computing analysis passes despite
> not making further changes to the function -- it would essentially
> require the optimizer to iterate until reaching a fixed point (for
> some likely artificial definition of a fixed point).
> 
> 
> I still think this is the correct design for the inliner and core
> optimizer pipeline, and I'm hoping that we might be able to make
> analysis pass caching and updating (rather than re-running)
> sufficiently aggressive to allow exploring this in the future... but
> there is a lot of infrastructure work to get there.
> 
> 
> Also, neither Duncan nor I were able to find copious examples where
> the lack of such a threshold caused runaway bad inlining decisions
> *and* this had a gross impact on overall performance. The current
> worst impact I've found are massive global constructor functions in
> Chromium which have template-expanded calls to 1000s of small
> functions inlined into them with no performance benefit and
> non-trivial code size growth. But so far, this hasn't even been
> found to be the high order bit in Chromium for either performance or
> code size, so.... :: shrug ::

So the conclusion is that we might not need a limit at all? Sounds good to me.

 -Hal

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory