[PATCHES] A module inliner pass with a greedy call site queue

Tue Aug 19 15:40:54 PDT 2014

----- Original Message -----
> From: "Nick Lewycky" <nicholas at mxc.ca>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Jiangning Liu" <Jiangning.Liu at arm.com>, llvm-commits at cs.uiuc.edu, "Yin Ma" <yinma at codeaurora.org>
> Sent: Wednesday, August 6, 2014 2:54:17 AM
> Subject: Re: [PATCHES] A module inliner pass with a greedy call site queue
> 
> Hal Finkel wrote:
> > I'd like you to elaborate on your assertion here, however, that a
> > "topdown inliner is going to work best when you have the whole
> > program." It seems to me that, whole program or not, a top-down
> > inlining approach is exactly what you want to avoid the
> > vector-push_back-cold-path-inlining problem (because, from the
> > caller, you see many calls to push_back, which is small -- because
> > the hot path is small and the cold path has not (yet) been inlined
> > -- and inlines them all, at which point it can make a sensible
> > decision about the cold-path calls).
> 
> I don't see that. You get the same information when looking at a pair
> of
> functions and deciding whether to inline. With the bottom-up walk, we
> analyze the caller and callee in their entirety before deciding
> whether
> to inline. I assume a top-down inliner would do the same.

You don't get the same information because the functions are different. To be clear, I completely agree that bottom-up inlining is more likely to expose further simplification opportunities, and thus assist with analysis. It is not clear to me, however, that this property always makes it better. The canonical example of this seems to be some kind of vector push_back:

  void push_back(const T &e) {
    if (size < reserved)
      data[size++] = e;
    else
      push_back_with_realloc(e);
  }

if we assume that we only ever keep one version of the function, then we have a choice: do you inline the call to push_back_with_realloc? If we think of push_back as a function that will be called by some external caller (which seems to be the viewpoint that bottom-up inlining requires), then you might conclude it worthwhile to inline the call to push_back_with_realloc into push_back. However, if you take the viewpoint that push_back itself is being inlining into its callers, then you might easily reach a different decision. If push_back is being inlined, then you might leave the calls to push_back_with_realloc as-is -- if the caller has many calls to push_back, inlining the version that contains only the fast path could be much better than inlining a version of push_back which contains an inlined slow path. Unfortunately, how you wish to inline push_back_with_realloc really seems to depend on whether or not you inlining push_back into its callers, and this seems to argue for a more top-down approach.

Now, of course, if push_back_with_realloc is actually also small, you'd want to inline it regardless. And inlining bottom-up seems like it will yield better information on how big the function will be. It might be that the top-down approach works well only because it matches, in some sense, a common code idiom (a small function with the fast path and a call to the slow path -- where the fast path is clear because it does not itself rely on many getters/setters/etc). If that's true, we need to understand how adding more context awareness to the bottom-up inliner will address that. The proposed patch uses some hybrid method, and perhaps that kind of approach can simultaneously do everything well ;)

It is also not obvious to me that keeping only one version of the function, used both for inlining and for standalone generation, is ideal. Perhaps changing that is really what is needed to address things properly.

Thanks again,
Hal

> 
> If you have a top-down traversal and you don't have the whole
> program,
> the first problem you have is a whole ton of starting points. At
> first
> blush bottom up seems to have the same problem, except that they are
> generally very straight-forward leaf functions -- setters and getters
> or
> little loops to test for a property. Top down you don't yet know what
> you've got, and it has lots of calls that may access arbitrary
> memory.
> In either case, you apply your metric to inline or not. Then you run
> the
> function-level passes to perform simplification. Bottom up, you're
> much
> more likely to get meaningful simplifications -- your getter/setter
> melts away. Top down you may remove some redundant loads or dead
> stores,
> but you still don't know what's going on because you have these
> opaque
> not-yet-analyzed callees in the way. If you couldn't analyze the
> memory
> before, inlining one level away hasn't helped you, and the function
> size
> has grown. You don't get the simplifications until you go all the way
> down the call stack to the setters and getters etc.
> 
> There's a fix for this, and that's to perform a sort of symbolic
> execution and just keep track of what the program has done so far
> (ie.
> what values registers have taken on so far, which pointers have
> escaped
> etc.), and make each inlining decision in program execution order.
> But
> that fix doesn't get you very far if you haven't got a significant
> chunk
> of program to work with.
> 
> Nick
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory