[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

Thu Dec 10 17:50:11 PST 2015

> On Dec 10, 2015, at 5:12 PM, Philip Reames via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> 
> 
> On 12/10/2015 05:03 PM, Xinliang David Li wrote:
>> On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
>> <listmail at philipreames.com> wrote:
>>> 
>>> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>>>> <listmail at philipreames.com> wrote:
>>>>> Given I didn't get any response to my original query, I chose not to
>>>>> invest
>>>>> time in this at the time.  I am unlikely to get time for this in the near
>>>>> future.
>>>>> 
>>>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>>> 
>>>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>>>> llvm-dev at lists.llvm.org)
>>>>> 
>>>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman at google.com> wrote:
>>>>>> Hi Philip,
>>>>>> 
>>>>>>   Is there any update on this? I've been sending patches to get rid of
>>>>>> the
>>>>>> callee hotness based inline hints from the frontend and move the logic
>>>>>> to
>>>>>> the inliner. The next step is to use the callsite hotness instead. I
>>>>>> also
>>>>>> want to focus on the infrastructure to enable this and what I've been
>>>>>> experimenting with is similar to your two alternative approaches:
>>>>>> 
>>>>>>> Alternate Approaches:
>>>>>>> 1) We could just recompute BFI explicitly in the inliner right before
>>>>>>> passing the result to ICA for the purposes of prototyping. If this was
>>>>>>> off
>>>>>>> by default, this might be a reasonable scheme for investigation.  This
>>>>>>> could
>>>>>>> likely never be enabled for real uses.
>>>>>>> 2) We could pre-compute BFI once per function within a given SCC and
>>>>>>> then
>>>>>>> try to keep it up to date during inlining.  If we cached the call
>>>>>>> frequencies for the initial call sites, we could adjust the visit order
>>>>>>> to
>>>>>>> minimize the number of times we need to recompute a given functions
>>>>>>> block
>>>>>>> frequencies.  (e.g. we can look at all the original call sites within a
>>>>>>> function before looking at newly inlined ones)
>>>>>>> 
>>>>>> My proposal is very similar (perhaps identical) to your option 2 above.
>>>>>> I
>>>>>> don't understand the part where you talk about adjusting the visit order
>>>>>> to
>>>>>> minimize BFI computation.
>>>>>> 
>>>>>> BFI computation:  BFI for a function is computed on demand and cached.
>>>>> The problem here is that the old pass manager does not have a place/way
>>>>> to
>>>>> cache the function level analysis.  You could do this within the inliner
>>>>> itself, but the function passes run by the SCC pass manager will
>>>>> invalidate
>>>>> everything.  That's the problem I was trying to solve.
>>>> Easwaran's proposal does not rely on the analysis pass framework of
>>>> the legacy pass manager -- that is the key idea. BFI info is designed
>>>> in a way such that it can be computed on-demand.
>>>> 
>>>> BFI for the caller will be incrementally updated during inlining, but
>>>> after inlining is done for a node, the BFI info will be explicitly
>>>> invalidated for that node due to the function passes run later by SCC
>>>> pass manager.  This is expected.
>>> Just to make sure, you're aware this will cost at least one re-computation
>>> per iteration in CGPassManager::runOnModule right? If that's expected and
>>> acceptable, then yes, the approach you seem to be describing should be
>>> workable.
>>> 
>> For each leaf node, the number of BFI computation is <= 1. For other
>> nodes, it is <=2 (one as caller, one as callee) -- in other words, it
>> is linear to the number of functions.
> This is not true.
> 
> In CGPassManager::runOnModule, we will iterate on the *same* SCC multiple times if we're making progress (i.e. devirtualizing call sites) on each iteration.  The BFI will be invalidated on each iteration.  This is caped at 4 iterations, so it's still semi linear, but only because there's a cap.
> 
> Examples where this might kick in:
> void bar() {};
> func_ptr foo() { return &bar; }
> void caller() {
>  func_ptr f = foo();
>  f();
> }
> 
> We will visit caller twice so that we can inline both foo and bar.

I think you would need a far more complex example to trigger the revisit: GVN or other need to actually remove a call, the inliner doing its job won’t trigger a reiteration at this level.

— 
Mehdi

> 
> I think this iteration is okay for what your proposing - i.e. off by default, migration step, etc - but it's important to understand the implications of the planned changes.
> 
> 
>> 
>> 
>>> Particularly as an intermediate step to parallelize work with the new pass
>>> manager, this sounds very much worthwhile.
>> Yes, that is exactly the intention -- it can buy pass manager rework
>> more time while unlock lots of pending performance tuning work we plan
>> to do.
>> 
>> thanks,
>> 
>> David
>> 
>> 
>>>>>> Update: When 'bar' gets inlined into 'foo', the BFI for 'foo' is
>>>>>> updated.
>>>>>> Let OldBB in 'bar' gets cloned as NewBB in 'foo'.  NewBB's block
>>>>>> frequency
>>>>>> can be incrementally computed from OldBB's block frequency, entry block
>>>>>> frequency of 'bar' and the frequency of the block containing the 'foo'
>>>>>> ->
>>>>>> 'bar' callsite. Even when the new CGSCC level BFI analysis is in place,
>>>>>> this
>>>>>> incremental update is useful to minimize computation.
>>>>>> 
>>>>>> Invalidation:  Once inlining is completed in an SCC (at the end of
>>>>>> runOnSCC), the entries for functions in that SCC are invalidated since
>>>>>> other
>>>>>> passes run by the CGSCC pass manager (including those run by the
>>>>>> function
>>>>>> pass manager run under CGSCC pass manager) might affect the computed BFI
>>>>>> for
>>>>>> the functions in the SCC.
>>>>> This sounds like you're essentially planning to extend the old pass
>>>>> manager
>>>>> with function level analysis support in the SCC pass manager.  I advise
>>>>> against this given the fact that Chandler is actively working on the new
>>>>> one
>>>>> and getting patches reviewed for the old one is "challenging" at best.
>>>>> If
>>>>> you take this approach, make sure Chandler - who is pretty much the only
>>>>> active reviewer in this area - is on-board with your approach before you
>>>>> start.
>>>> There is no plan to extend old pass manager for this capability.  The
>>>> proposed solution is intended to work with the new pass manager too,
>>>> so there is no conflict here. The BFI book-keeping by inliner can be
>>>> removed with pass manager change is in place.
>>> Thanks for the clarification.
>>> 
>>>> David
>>>> 
>>>> 
>>>>>> When the new PM infrastructure and a CGSCC based BFI analysis is in
>>>>>> place,
>>>>>> the transition should be easy assuming it will provide getBFI(Function
>>>>>> *)
>>>>>> and invalidateBFI(Function *) interfaces. BFI for a function is computed
>>>>>> at
>>>>>> most twice in this approach. Thoughts?
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Easwaran
>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev