[LLVMdev] Path forward on profile guided inlining?
Philip Reames
listmail at philipreames.com
Wed Jun 17 15:49:13 PDT 2015
I would like to start prototyping changes towards profile guided
inlining. Before doing so, I wanted to get feedback from the community
on the appropriate approach. I'm going to layout a strawman proposal,
but I'm open to other ideas on how to approach the problem.
Depending on what approach we settle on, I *may* be able to commit
resources to actually implement this in the near term. I can't commit
to too much work here, so if we settle on something which will need a
lot of enabling infrastructure work first, I'm likely going to need to
just hack this within my local tree. I'm hoping to avoid that, but I
want to be up front about the practicalities here.
Now, on to the strawman proposal...
This proposal is intended to not be dependent on the ongoing pass manger
changes by Chandler. It's intended to be compatible with that work, but
not reliant on it.
Background
One of the problems we have today is that we can't ask for information
about the frequency of the call site we're examining within the inlining
iteration loop. We can get at this information from normal transform
passes run within the outer inlining loop (CallGraphSCCPass), but the
inner loop does not have the ability to keep function analysis up to
date while inlining.
We have an existing piece of metadata on functions which record the
function entry count. This is not currently kept up to date in any way;
it records the entry count of the function w/o.r.t. to inlining. This
metadata was only recently added and doesn't effect optimization today.
Objective
I'm looking to be able to implement two key inlining heuristics.
The first is to recognize when we have the last remaining *actual* call
to a callee. If we can prove there's a single caller today, we
essentially just assume it's profitable to inline. I want to extend
this logic to the case where we have a single call site which dominates
the caller profile for a given callee. We might need to still keep
around a copy of the callee (external, or rare callers), but the
remaining callee definition will be a cold function, can be optimized
for size, and can be placed far from hot code.
Second, I would like to use the frequency of the call site to adjust the
inline threshold within the inliner. This basically means that we
become more willing to inline hot functions (and probably less willing
to inline cold ones).
At least to start with, both of these optimizations would be off by
default under a flag. I figure there's a lot to be discussed here, but
I'd prefer to defer that a bit. We need to get the enabling parts in
place before worrying too much about the heuristics.
The Ideal Answer
If we had the pass manager changes done, we'd be able to just ask BFI to
compute an estimated execution count for the call site. Both of the
inliner heuristics I'm interested in could be implemented using that
information combined with an up to date estimate of the function entry
count.
If we had the pass manager changes, the only thing we'd need to do is
update InlineFunction to subtract the estimated call count from the
entry count of the remaining callee definition. This would result in
the entry count of B reflecting the estimated entry count from the
remaining callers of B. (Given the call site count is an estimate, this
implies that entry counts are also now approximate. Given typical
profiling mechanisms are lossy, that's not a big deal, but is worth
noting explicitly.)
(See also my note at the bottom about iteration order. I consider that
somewhat out of scope for this proposal, but it effects both the ideal
and practical sections herein.)
The Practical Answer
Essentially, the remainder of this is an approximation of the pass
manager based solution which let's us start experimenting with the
inliner changes while waiting for the pass manager. At least in theory,
when we have the pass manager in place, we can simply drop most of this.
The basic idea is to use metadata to effectively cache the estimated
frequency for a given call site. We can use BFI to generate/update this
information once per iteration of the outer inlining loop. As a result,
we only need to keep it reasonably up to date within the inner inlining
loop. (See note below about alternate approaches.)
The metadata on the call site would look something like:
call void @foo() !prof !"estimated_execution_count" 2000 (where 2000 is
the estimated count)
We'd have a new FunctionPass which removes any existing metadata and
adds new metadata which basically comes down to a product of the
functions entry count and the estimated block frequency (provided by
BFI) of the block containing the call site. This would run as the last
FunctionPass within the outer inlining loop. Assuming that the entry
counts are kept reasonably up to date, the resulting estimated call site
counts should be reasonable.
Within the inliner itself, we'd need to update InlineFunction to keep
the estimated counts in sync:
- When inlining B into A, split the estimated call counts on the calls
within B between the remaining instance of B and the newly created call
sites within A. I plan to split the estimated count in roughly the
ratio of the entries into B. As a result, a given call site BC
(originally from B into C) would be split into two sites AC, and BC with
estimated counts (AB.count/B.entry_count * OrigBC.count) and
((1-AB.count/B.entry_count) * OrigBC.count). This does require updating
the body of the callee, not just the caller when inlining.
It's important to note that the resulting estimated call counts can be
just flat out wrong. As the easiest example, consider a case where B
called C, but only when a boolean parameter was set to true. If we
split the count of BC into AC, BC and then drop the call site AC, we've
essentially lost information about the frequency of the remaining BC
w.r.t. any other callers of C. We could try to adjust for this by only
updating calls which don't get pruned away by constant propagation
within InlineFunction, but I'm not sure how worthwhile it is trying to
be smart here given the information will be roughly restored once we're
out of the inner loop again.
What we can assume (and thus make inlining decisions on), is that a) a
given call sites count is a reasonable approximation of it's actual
execution count and b) that the sum of the call site counts for a given
callee is less than or equal to the callee's entry count. What we can't
do is compare two call sites counts for the same callee within two
different functions and decide with high confidence which is more frequent.
Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before
passing the result to ICA for the purposes of prototyping. If this was
off by default, this might be a reasonable scheme for investigation.
This could likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and
then try to keep it up to date during inlining. If we cached the call
frequencies for the initial call sites, we could adjust the visit order
to minimize the number of times we need to recompute a given functions
block frequencies. (e.g. we can look at all the original call sites
within a function before looking at newly inlined ones)
For the record, it's not clear that the bottom up inlining approach
works anywhere near as well once you starting using profiling
information about the caller/callee pair. You would seem to end up with
cases where inlining BC makes it more likely you'll inline DC, but if
you visited them in the other order you wouldn't inline DC. Consider the
case where BC is a hot call site, and DC is the only other caller of C.
I have no plans to actually tackle this problem.
More information about the llvm-dev
mailing list