[LLVMdev] Path forward on profile guided inlining?

Thu Jun 18 23:10:34 PDT 2015

> The memory size part doesn't surprise me for at least some analysis passes.
> What's the runtime cost of the incremental update?

I don't have such data for LLVM, but it is done in other compiler and
it is not expensive to do.

>
> We start with:
> A calls B with call site count 200
> B calls C with call site count 200 (but never when called from A)
> X calls C with call site count 50
> A's entry count is 200
> B's entry count is 210
> C's entry count is 250
>
> After inlining AB, we get:
> A calls C with call site count 0 (because we proved away the call site)
> B calls C with call site count 10/210 (small fraction of original)
> X calls C with call site count 50
> A's entry count is 200
> B's entry count is 10
> C's entry count is 250
>
> The information for BC is wildly inaccurate and we've mistated the ratio
> between call sites BC and BX.  Any decision we make off this data is likely
> to be questionable.  This isn't a correctness issue, but it's definitely
> makes writing optimization heuristics harder.

that is what I tried to say. The hotness of the callsite B->C depends
on the context -- when it is called from A, C is rarely called, but
when called by X, C will be called -- even though B is mostly called
by A.  As in

A ()
{
  for (i = 0; i < 200; i++)
     B(0);
}
X()
{
 for (i = 0; i < 10; i++)
   B(20);
}
B(int n) {
   for (i = 0; i < n; i++)
      C();
}
int main() { A(); X();}

The simple scaling can not handle this. The solution in this case is
that if the compiler knows the exact count of A->C after inlining B,
instead of doing simple scaling to compute the remaining count of B->C
as 200*10/210, it can do more precise update by subtracting:
count(B->C) = original_count (B->C) - count(A->C)

>
> (Note that the ability to rescale by block frequency within B solves this
> problem quite nicely.  Thus, we want the pass manager changes!)

How can that help? If BB frequency is not properly updated, we will
have the same problem.

thanks,

David
>
>
>>
>>> What we can assume (and thus make inlining decisions on), is that a) a
>>> given
>>> call sites count is a reasonable approximation of it's actual execution
>>> count and b) that the sum of the call site counts for a given callee is
>>> less
>>> than or equal to the callee's entry count.  What we can't do is compare
>>> two
>>> call sites counts for the same callee within two different functions and
>>> decide with high confidence which is more frequent.
>>>
>>>
>>> Alternate Approaches:
>>> 1) We could just recompute BFI explicitly in the inliner right before
>>> passing the result to ICA for the purposes of prototyping. If this was
>>> off
>>> by default, this might be a reasonable scheme for investigation.  This
>>> could
>>> likely never be enabled for real uses.
>>> 2) We could pre-compute BFI once per function within a given SCC and then
>>> try to keep it up to date during inlining.  If we cached the call
>>> frequencies for the initial call sites, we could adjust the visit order
>>> to
>>> minimize the number of times we need to recompute a given functions block
>>> frequencies.  (e.g. we can look at all the original call sites within a
>>> function before looking at newly inlined ones)
>>>
>>> For the record, it's not clear that the bottom up inlining approach works
>>> anywhere near as well once you starting using profiling information about
>>> the caller/callee pair.  You would seem to end up with cases where
>>> inlining
>>> BC makes it more likely you'll inline DC, but if you visited them in the
>>> other order you wouldn't inline DC. Consider the case where BC is a hot
>>> call
>>> site, and DC is the only other caller of C.  I have no plans to actually
>>> tackle this problem.
>>
>> Inliner can be taught to delay inlining and use a top down scheme. For
>> instance if B has lots of incoming calls, but there is only one
>> dominating call edge from A -- in such as case, it is likely better to
>> delay inlining decisions of callsites in B that may have large
>> overhead until  B gets inlined ..
>
> It can be done; agreed.  Just out of scope for the moment.
>>
>>
>> thanks,
>>
>> David
>
>