[LLVMdev] Path forward on profile guided inlining?

Thu Jul 30 22:35:17 PDT 2015

Hi Chandler, I'd like to hear your thought about this interim solution
to make global profile information available to inliner.

thanks,

David

On Wed, Jun 17, 2015 at 3:49 PM, Philip Reames
<listmail at philipreames.com> wrote:
> I would like to start prototyping changes towards profile guided inlining.
> Before doing so, I wanted to get feedback from the community on the
> appropriate approach.  I'm going to layout a strawman proposal, but I'm open
> to other ideas on how to approach the problem.
>
> Depending on what approach we settle on, I *may* be able to commit resources
> to actually implement this in the near term.  I can't commit to too much
> work here, so if we settle on something which will need a lot of enabling
> infrastructure work first, I'm likely going to need to just hack this within
> my local tree.  I'm hoping to avoid that, but I want to be up front about
> the practicalities here.
>
> Now, on to the strawman proposal...
>
> This proposal is intended to not be dependent on the ongoing pass manger
> changes by Chandler.  It's intended to be compatible with that work, but not
> reliant on it.
>
> Background
> One of the problems we have today is that we can't ask for information about
> the frequency of the call site we're examining within the inlining iteration
> loop.  We can get at this information from normal transform passes run
> within the outer inlining loop (CallGraphSCCPass), but the inner loop does
> not have the ability to keep function analysis up to date while inlining.
>
> We have an existing piece of metadata on functions which record the function
> entry count.  This is not currently kept up to date in any way; it records
> the entry count of the function w/o.r.t. to inlining.  This metadata was
> only recently added and doesn't effect optimization today.
>
> Objective
> I'm looking to be able to implement two key inlining heuristics.
>
> The first is to recognize when we have the last remaining *actual* call to a
> callee.  If we can prove there's a single caller today, we essentially just
> assume it's profitable to inline.  I want to extend this logic to the case
> where we have a single call site which dominates the caller profile for a
> given callee.  We might need to still keep around a copy of the callee
> (external, or rare callers), but the remaining callee definition will be a
> cold function, can be optimized for size, and can be placed far from hot
> code.
>
> Second, I would like to use the frequency of the call site to adjust the
> inline threshold within the inliner.  This basically means that we become
> more willing to inline hot functions (and probably less willing to inline
> cold ones).
>
> At least to start with, both of these optimizations would be off by default
> under a flag.  I figure there's a lot to be discussed here, but I'd prefer
> to defer that a bit.  We need to get the enabling parts in place before
> worrying too much about the heuristics.
>
>
> The Ideal Answer
> If we had the pass manager changes done, we'd be able to just ask BFI to
> compute an estimated execution count for the call site.  Both of the inliner
> heuristics I'm interested in could be implemented using that information
> combined with an up to date estimate of the function entry count.
>
> If we had the pass manager changes, the only thing we'd need to do is update
> InlineFunction to subtract the estimated call count from the entry count of
> the remaining callee definition.  This would result in the entry count of B
> reflecting the estimated entry count from the remaining callers of B.
> (Given the call site count is an estimate, this implies that entry counts
> are also now approximate. Given typical profiling mechanisms are lossy,
> that's not a big deal, but is worth noting explicitly.)
>
> (See also my note at the bottom about iteration order.  I consider that
> somewhat out of scope for this proposal, but it effects both the ideal and
> practical sections herein.)
>
>
> The Practical Answer
> Essentially, the remainder of this is an approximation of the pass manager
> based solution which let's us start experimenting with the inliner changes
> while waiting for the pass manager.  At least in theory, when we have the
> pass manager in place, we can simply drop most of this.
>
> The basic idea is to use metadata to effectively cache the estimated
> frequency for a given call site.  We can use BFI to generate/update this
> information once per iteration of the outer inlining loop.  As a result, we
> only need to keep it reasonably up to date within the inner inlining loop.
> (See note below about alternate approaches.)
>
> The metadata on the call site would look something like:
> call void @foo() !prof !"estimated_execution_count" 2000 (where 2000 is the
> estimated count)
>
> We'd have a new FunctionPass which removes any existing metadata and adds
> new metadata which basically comes down to a product of the functions entry
> count and the estimated block frequency (provided by BFI) of the block
> containing the call site.  This would run as the last FunctionPass within
> the outer inlining loop.  Assuming that the entry counts are kept reasonably
> up to date, the resulting estimated call site counts should be reasonable.
>
> Within the inliner itself, we'd need to update InlineFunction to keep the
> estimated counts in sync:
> - When inlining B into A, split the estimated call counts on the calls
> within B between the remaining instance of B and the newly created call
> sites within A.  I plan to split the estimated count in roughly the ratio of
> the entries into B.  As a result, a given call site BC (originally from B
> into C) would be split into two sites AC, and BC with estimated counts
> (AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count) *
> OrigBC.count).  This does require updating the body of the callee, not just
> the caller when inlining.
>
> It's important to note that the resulting estimated call counts can be just
> flat out wrong.  As the easiest example, consider a case where B called C,
> but only when a boolean parameter was set to true.  If we split the count of
> BC into AC, BC and then drop the call site AC, we've essentially lost
> information about the frequency of the remaining BC w.r.t. any other callers
> of C.  We could try to adjust for this by only updating calls which don't
> get pruned away by constant propagation within InlineFunction, but I'm not
> sure how worthwhile it is trying to be smart here given the information will
> be roughly restored once we're out of the inner loop again.
>
> What we can assume (and thus make inlining decisions on), is that a) a given
> call sites count is a reasonable approximation of it's actual execution
> count and b) that the sum of the call site counts for a given callee is less
> than or equal to the callee's entry count.  What we can't do is compare two
> call sites counts for the same callee within two different functions and
> decide with high confidence which is more frequent.
>
>
> Alternate Approaches:
> 1) We could just recompute BFI explicitly in the inliner right before
> passing the result to ICA for the purposes of prototyping. If this was off
> by default, this might be a reasonable scheme for investigation.  This could
> likely never be enabled for real uses.
> 2) We could pre-compute BFI once per function within a given SCC and then
> try to keep it up to date during inlining.  If we cached the call
> frequencies for the initial call sites, we could adjust the visit order to
> minimize the number of times we need to recompute a given functions block
> frequencies.  (e.g. we can look at all the original call sites within a
> function before looking at newly inlined ones)
>
> For the record, it's not clear that the bottom up inlining approach works
> anywhere near as well once you starting using profiling information about
> the caller/callee pair.  You would seem to end up with cases where inlining
> BC makes it more likely you'll inline DC, but if you visited them in the
> other order you wouldn't inline DC. Consider the case where BC is a hot call
> site, and DC is the only other caller of C.  I have no plans to actually
> tackle this problem.