[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

Thu Dec 10 17:12:50 PST 2015

On 12/10/2015 05:03 PM, Xinliang David Li wrote:
> On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
> <listmail at philipreames.com> wrote:
>>
>> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>>> <listmail at philipreames.com> wrote:
>>>> Given I didn't get any response to my original query, I chose not to
>>>> invest
>>>> time in this at the time.  I am unlikely to get time for this in the near
>>>> future.
>>>>
>>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>>
>>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>>> llvm-dev at lists.llvm.org)
>>>>
>>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman at google.com> wrote:
>>>>> Hi Philip,
>>>>>
>>>>>    Is there any update on this? I've been sending patches to get rid of
>>>>> the
>>>>> callee hotness based inline hints from the frontend and move the logic
>>>>> to
>>>>> the inliner. The next step is to use the callsite hotness instead. I
>>>>> also
>>>>> want to focus on the infrastructure to enable this and what I've been
>>>>> experimenting with is similar to your two alternative approaches:
>>>>>
>>>>>> Alternate Approaches:
>>>>>> 1) We could just recompute BFI explicitly in the inliner right before
>>>>>> passing the result to ICA for the purposes of prototyping. If this was
>>>>>> off
>>>>>> by default, this might be a reasonable scheme for investigation.  This
>>>>>> could
>>>>>> likely never be enabled for real uses.
>>>>>> 2) We could pre-compute BFI once per function within a given SCC and
>>>>>> then
>>>>>> try to keep it up to date during inlining.  If we cached the call
>>>>>> frequencies for the initial call sites, we could adjust the visit order
>>>>>> to
>>>>>> minimize the number of times we need to recompute a given functions
>>>>>> block
>>>>>> frequencies.  (e.g. we can look at all the original call sites within a
>>>>>> function before looking at newly inlined ones)
>>>>>>
>>>>> My proposal is very similar (perhaps identical) to your option 2 above.
>>>>> I
>>>>> don't understand the part where you talk about adjusting the visit order
>>>>> to
>>>>> minimize BFI computation.
>>>>>
>>>>> BFI computation:  BFI for a function is computed on demand and cached.
>>>> The problem here is that the old pass manager does not have a place/way
>>>> to
>>>> cache the function level analysis.  You could do this within the inliner
>>>> itself, but the function passes run by the SCC pass manager will
>>>> invalidate
>>>> everything.  That's the problem I was trying to solve.
>>> Easwaran's proposal does not rely on the analysis pass framework of
>>> the legacy pass manager -- that is the key idea. BFI info is designed
>>> in a way such that it can be computed on-demand.
>>>
>>> BFI for the caller will be incrementally updated during inlining, but
>>> after inlining is done for a node, the BFI info will be explicitly
>>> invalidated for that node due to the function passes run later by SCC
>>> pass manager.  This is expected.
>> Just to make sure, you're aware this will cost at least one re-computation
>> per iteration in CGPassManager::runOnModule right? If that's expected and
>> acceptable, then yes, the approach you seem to be describing should be
>> workable.
>>
> For each leaf node, the number of BFI computation is <= 1. For other
> nodes, it is <=2 (one as caller, one as callee) -- in other words, it
> is linear to the number of functions.
This is not true.

In CGPassManager::runOnModule, we will iterate on the *same* SCC 
multiple times if we're making progress (i.e. devirtualizing call sites) 
on each iteration.  The BFI will be invalidated on each iteration.  This 
is caped at 4 iterations, so it's still semi linear, but only because 
there's a cap.

Examples where this might kick in:
void bar() {};
func_ptr foo() { return &bar; }
void caller() {
   func_ptr f = foo();
   f();
}

We will visit caller twice so that we can inline both foo and bar.

I think this iteration is okay for what your proposing - i.e. off by 
default, migration step, etc - but it's important to understand the 
implications of the planned changes.

>
>
>> Particularly as an intermediate step to parallelize work with the new pass
>> manager, this sounds very much worthwhile.
> Yes, that is exactly the intention -- it can buy pass manager rework
> more time while unlock lots of pending performance tuning work we plan
> to do.
>
> thanks,
>
> David
>
>
>>>>> Update: When 'bar' gets inlined into 'foo', the BFI for 'foo' is
>>>>> updated.
>>>>> Let OldBB in 'bar' gets cloned as NewBB in 'foo'.  NewBB's block
>>>>> frequency
>>>>> can be incrementally computed from OldBB's block frequency, entry block
>>>>> frequency of 'bar' and the frequency of the block containing the 'foo'
>>>>> ->
>>>>> 'bar' callsite. Even when the new CGSCC level BFI analysis is in place,
>>>>> this
>>>>> incremental update is useful to minimize computation.
>>>>>
>>>>> Invalidation:  Once inlining is completed in an SCC (at the end of
>>>>> runOnSCC), the entries for functions in that SCC are invalidated since
>>>>> other
>>>>> passes run by the CGSCC pass manager (including those run by the
>>>>> function
>>>>> pass manager run under CGSCC pass manager) might affect the computed BFI
>>>>> for
>>>>> the functions in the SCC.
>>>> This sounds like you're essentially planning to extend the old pass
>>>> manager
>>>> with function level analysis support in the SCC pass manager.  I advise
>>>> against this given the fact that Chandler is actively working on the new
>>>> one
>>>> and getting patches reviewed for the old one is "challenging" at best.
>>>> If
>>>> you take this approach, make sure Chandler - who is pretty much the only
>>>> active reviewer in this area - is on-board with your approach before you
>>>> start.
>>> There is no plan to extend old pass manager for this capability.  The
>>> proposed solution is intended to work with the new pass manager too,
>>> so there is no conflict here. The BFI book-keeping by inliner can be
>>> removed with pass manager change is in place.
>> Thanks for the clarification.
>>
>>> David
>>>
>>>
>>>>> When the new PM infrastructure and a CGSCC based BFI analysis is in
>>>>> place,
>>>>> the transition should be easy assuming it will provide getBFI(Function
>>>>> *)
>>>>> and invalidateBFI(Function *) interfaces. BFI for a function is computed
>>>>> at
>>>>> most twice in this approach. Thoughts?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Easwaran
>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>