[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

Thu Dec 10 17:28:39 PST 2015

On Thu, Dec 10, 2015 at 5:12 PM, Philip Reames
<listmail at philipreames.com> wrote:
>
>
> On 12/10/2015 05:03 PM, Xinliang David Li wrote:
>>
>> On Thu, Dec 10, 2015 at 4:51 PM, Philip Reames
>> <listmail at philipreames.com> wrote:
>>>
>>>
>>> On 12/10/2015 04:29 PM, Xinliang David Li wrote:
>>>>
>>>> On Thu, Dec 10, 2015 at 4:00 PM, Philip Reames
>>>> <listmail at philipreames.com> wrote:
>>>>>
>>>>> Given I didn't get any response to my original query, I chose not to
>>>>> invest
>>>>> time in this at the time.  I am unlikely to get time for this in the
>>>>> near
>>>>> future.
>>>>>
>>>>> On 12/07/2015 03:13 PM, Easwaran Raman wrote:
>>>>>
>>>>> (Resending after  removing llvmdev at cs.uiuc.edu and using
>>>>> llvm-dev at lists.llvm.org)
>>>>>
>>>>> On Mon, Dec 7, 2015 at 3:08 PM, Easwaran Raman <eraman at google.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Philip,
>>>>>>
>>>>>>    Is there any update on this? I've been sending patches to get rid
>>>>>> of
>>>>>> the
>>>>>> callee hotness based inline hints from the frontend and move the logic
>>>>>> to
>>>>>> the inliner. The next step is to use the callsite hotness instead. I
>>>>>> also
>>>>>> want to focus on the infrastructure to enable this and what I've been
>>>>>> experimenting with is similar to your two alternative approaches:
>>>>>>
>>>>>>> Alternate Approaches:
>>>>>>> 1) We could just recompute BFI explicitly in the inliner right before
>>>>>>> passing the result to ICA for the purposes of prototyping. If this
>>>>>>> was
>>>>>>> off
>>>>>>> by default, this might be a reasonable scheme for investigation.
>>>>>>> This
>>>>>>> could
>>>>>>> likely never be enabled for real uses.
>>>>>>> 2) We could pre-compute BFI once per function within a given SCC and
>>>>>>> then
>>>>>>> try to keep it up to date during inlining.  If we cached the call
>>>>>>> frequencies for the initial call sites, we could adjust the visit
>>>>>>> order
>>>>>>> to
>>>>>>> minimize the number of times we need to recompute a given functions
>>>>>>> block
>>>>>>> frequencies.  (e.g. we can look at all the original call sites within
>>>>>>> a
>>>>>>> function before looking at newly inlined ones)
>>>>>>>
>>>>>> My proposal is very similar (perhaps identical) to your option 2
>>>>>> above.
>>>>>> I
>>>>>> don't understand the part where you talk about adjusting the visit
>>>>>> order
>>>>>> to
>>>>>> minimize BFI computation.
>>>>>>
>>>>>> BFI computation:  BFI for a function is computed on demand and cached.
>>>>>
>>>>> The problem here is that the old pass manager does not have a place/way
>>>>> to
>>>>> cache the function level analysis.  You could do this within the
>>>>> inliner
>>>>> itself, but the function passes run by the SCC pass manager will
>>>>> invalidate
>>>>> everything.  That's the problem I was trying to solve.
>>>>
>>>> Easwaran's proposal does not rely on the analysis pass framework of
>>>> the legacy pass manager -- that is the key idea. BFI info is designed
>>>> in a way such that it can be computed on-demand.
>>>>
>>>> BFI for the caller will be incrementally updated during inlining, but
>>>> after inlining is done for a node, the BFI info will be explicitly
>>>> invalidated for that node due to the function passes run later by SCC
>>>> pass manager.  This is expected.
>>>
>>> Just to make sure, you're aware this will cost at least one
>>> re-computation
>>> per iteration in CGPassManager::runOnModule right? If that's expected and
>>> acceptable, then yes, the approach you seem to be describing should be
>>> workable.
>>>
>> For each leaf node, the number of BFI computation is <= 1. For other
>> nodes, it is <=2 (one as caller, one as callee) -- in other words, it
>> is linear to the number of functions.
>
> This is not true.
>
> In CGPassManager::runOnModule, we will iterate on the *same* SCC multiple
> times if we're making progress (i.e. devirtualizing call sites) on each
> iteration.  The BFI will be invalidated on each iteration.  This is caped at
> 4 iterations, so it's still semi linear, but only because there's a cap.
>
> Examples where this might kick in:
> void bar() {};
> func_ptr foo() { return &bar; }
> void caller() {
>   func_ptr f = foo();
>   f();
> }
>
> We will visit caller twice so that we can inline both foo and bar.
>

Right. I would guess on average only a small portion of all SCCs can
trigger such iterative compilation.

David

> I think this iteration is okay for what your proposing - i.e. off by
> default, migration step, etc - but it's important to understand the
> implications of the planned changes.
>
>
>
>>
>>
>>> Particularly as an intermediate step to parallelize work with the new
>>> pass
>>> manager, this sounds very much worthwhile.
>>
>> Yes, that is exactly the intention -- it can buy pass manager rework
>> more time while unlock lots of pending performance tuning work we plan
>> to do.
>>
>> thanks,
>>
>> David
>>
>>
>>>>>> Update: When 'bar' gets inlined into 'foo', the BFI for 'foo' is
>>>>>> updated.
>>>>>> Let OldBB in 'bar' gets cloned as NewBB in 'foo'.  NewBB's block
>>>>>> frequency
>>>>>> can be incrementally computed from OldBB's block frequency, entry
>>>>>> block
>>>>>> frequency of 'bar' and the frequency of the block containing the 'foo'
>>>>>> ->
>>>>>> 'bar' callsite. Even when the new CGSCC level BFI analysis is in
>>>>>> place,
>>>>>> this
>>>>>> incremental update is useful to minimize computation.
>>>>>>
>>>>>> Invalidation:  Once inlining is completed in an SCC (at the end of
>>>>>> runOnSCC), the entries for functions in that SCC are invalidated since
>>>>>> other
>>>>>> passes run by the CGSCC pass manager (including those run by the
>>>>>> function
>>>>>> pass manager run under CGSCC pass manager) might affect the computed
>>>>>> BFI
>>>>>> for
>>>>>> the functions in the SCC.
>>>>>
>>>>> This sounds like you're essentially planning to extend the old pass
>>>>> manager
>>>>> with function level analysis support in the SCC pass manager.  I advise
>>>>> against this given the fact that Chandler is actively working on the
>>>>> new
>>>>> one
>>>>> and getting patches reviewed for the old one is "challenging" at best.
>>>>> If
>>>>> you take this approach, make sure Chandler - who is pretty much the
>>>>> only
>>>>> active reviewer in this area - is on-board with your approach before
>>>>> you
>>>>> start.
>>>>
>>>> There is no plan to extend old pass manager for this capability.  The
>>>> proposed solution is intended to work with the new pass manager too,
>>>> so there is no conflict here. The BFI book-keeping by inliner can be
>>>> removed with pass manager change is in place.
>>>
>>> Thanks for the clarification.
>>>
>>>> David
>>>>
>>>>
>>>>>> When the new PM infrastructure and a CGSCC based BFI analysis is in
>>>>>> place,
>>>>>> the transition should be easy assuming it will provide getBFI(Function
>>>>>> *)
>>>>>> and invalidateBFI(Function *) interfaces. BFI for a function is
>>>>>> computed
>>>>>> at
>>>>>> most twice in this approach. Thoughts?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Easwaran
>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>
>>>>>>
>