[llvm-dev] RFC: Synthetic function entry counts

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Fri Dec 15 13:28:10 PST 2017


On Fri, Dec 15, 2017 at 11:56 AM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Fri, Dec 15, 2017 at 11:27 AM, Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>>
>> On Fri, Dec 15, 2017 at 11:13 AM, Sean Silva <chisophugis at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Dec 15, 2017 at 10:22 AM, Easwaran Raman <eraman at google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Dec 15, 2017 at 12:22 AM, Sean Silva <chisophugis at gmail.com>
>>>> wrote:
>>>>
>>>>> IIUC, this proposal is just saying that we should infer a static
>>>>> profile for entry counts just like we do for branch probabilities. In the
>>>>> case of entry counts, we do not hide that information behind an analysis
>>>>> like BPI, so currently just annotating synthetic PGO entry counts is a
>>>>> simple solution that piggybacks on the PGO mechanism and Just Works.
>>>>>
>>>>> If that is correct then this makes perfect sense to me.
>>>>>
>>>> Yes, that is the proposal.
>>>>
>>>>>
>>>>> It could be argued that we ought to refactor things so that the raw
>>>>> PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls
>>>>> back to interprocedural analysis (i.e. static profile heuristics) when the
>>>>> entry count metadata is missing, just like BPI does with static
>>>>> intraprocedural analysis and branch weight metadata. However, we probably
>>>>> don't want to do that while folks are still depending on the old PM in
>>>>> production since CGSCC analyses don't exist there which would force us to
>>>>> maintain an old and new way of doing it.
>>>>>
>>>>> (Also, it sounds like you want to compute this with a top-down CGSCC
>>>>> traversal, so it might not actually be computable incrementally as a bottom
>>>>> up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
>>>>> module analysis for the top-down part might work around this though)
>>>>>
>>>>
>>>>>
>>>> Wrapping function entry counts behind an analysis might be doable but
>>>> requires more careful thought.
>>>>
>>>
>>> Definitely.
>>>
>>>
>>>> Right now getEntryCount is called from various passes: inliner, loop
>>>> passes (unroll, loop sink), codegenprepare (for function layout), machine
>>>> block placement, module summary analysis ... This might be worth attempting
>>>> when we fully migrate to the new pass manager but not now.
>>>>
>>>
>>> It seems like many of those places that directly reference getEntryCount
>>> are using it to check "HasProfileData". So adding synthetic values would
>>> cause us to take a lot of code paths in the compiler that were intended to
>>> use real profile data.
>>>
>>> As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
>>> MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
>>> present (the comment even says "This needs precise profile data and we only
>>> do this when profile data is available."). Do we need to refactor things
>>> first to use a more semantically meaningful check, like
>>> "hasRealProfileData" so that synthetic counts don't fool this code into
>>> thinking it has real profile data? At least, comments like the one I quoted
>>> need to be updated to indicate that a static profile is considered precise
>>> enough.
>>>
>>>
>>>
>> I think the existing code that checks 'hasEntryCount()' for
>> 'isProfileAvailable' information will need to be refactored to query
>> 'isRealProfileAvailable'.
>>
>>
>>>
>>> Also, one other question: in the case of FullLTO, how do we ensure that
>>> the synthetic profile data from static heuristics is consistent across
>>> TU's? Do we need to recalculate the synthetic profile data? If we need to
>>> recalculate it, presumably we need to know which profile data is synthetic
>>> or not. E.g. in a scenario where most TU's are compiled with profile data,
>>> but some are not (e.g. vendor or runtime lib delivered as bitcode),
>>> recalculated synthetic counts at FullLTO time should honor real profile
>>> counts but ignore synthetic counts from per-TU compilation.
>>>
>>> The PGO+FullLTO case with a runtime lib delivered as bitcode is not
>>> hypothetical; this was actually a common configuration used in production
>>> while I was at PlayStation. CC'ing Davide, River to confirm that they still
>>> care about this case.
>>>
>>>
>>> Also, more generally, can you clarify how your static entry count
>>> calculation algorithm is expected to work in the presence of partial
>>> profile data?
>>>
>>
>> I think the BFI information from those modules with real profile data can
>> be used for the IPA frequency propagation purpose, but the 'real' entry
>> count information will need to be dropped.  The entry counts for those
>> library functions are collected and aggregated from contexts of different
>> hosting binaries and their usefulness is very questionable.
>>
>
> The case at PlayStation was one where the vendor bitcode libraries simply
> did not have PGO data at all.
>
>
>>  They needs to be 'normalized' in the context of the target binary, which
>> is what the synthetic entry count computation is doing.
>>
>
> Real entry counts might still be useful. For example, in a GUI or web
> server app there may be a large number of functions that are called only
> indirectly via an event loop that might be in a system library (e.g. inside
> a libevent or Qt .so file). The collected branch probabilities are not
> enough to decide that one handler is much hotter than the other.
> Is the plan to currently not support this case? (i.e., we would require
> you to build the event loop library with PGO so that indirect call
> profiling can infer the hotness data)
>
>
Unless we can indirect call promote those handlers and inline them, the
global entry info for this handlers won't matter much though.

David




> -- Sean Silva
>
>
>> David
>>
>>
>>
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>> Also, the need to run this logic (or similar logic) as a "ThinLTO
>>>>> analysis" suggests not wedding it too much with the intricacies of the
>>>>> IR-level pass management (although admittedly we already do that with the
>>>>> inliner and then ThinLTO has to approximate those inlining decisions, so it
>>>>> might not be the end of the world to have some divergence).
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>> On Dec 12, 2017 5:02 PM, "Easwaran Raman via llvm-dev" <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Functions in LLVM IR have a function_entry_count metadata that is
>>>>>> attached in PGO compilation. By using the entry count together with the
>>>>>> block frequency info, the compiler computes the profile count of call
>>>>>> instructions based on which the hotness/coldness of callsites can be
>>>>>> determined. Experiments have shown that using a higher threshold for hot
>>>>>> callsites results in improved runtime performance of the generated code
>>>>>> without significant code size increases. We propose to generate synthetic
>>>>>> function counts for non-PGO compilation and use the counts for boosting hot
>>>>>> callsites during inlining.
>>>>>>
>>>>>> Synthetic function entry counts of functions are initialized based on
>>>>>> properties of the function such as whether it is visible outside the
>>>>>> module, whether it has an inline keyword and so on. Then, the callgraph SCC
>>>>>> is traversed in reverse post-order. Counts of callsites are determined
>>>>>> based on the entry count and the block frequency of the callsite. The
>>>>>> callsite count gets added to the entry count of the callee. For targets of
>>>>>> indirect calls, we will use the !callees metadata to find the possible
>>>>>> targets and distribute the count equally among them. For functions in a
>>>>>> non-trivial SCC, the algorithm has to ensure that the counts are stable and
>>>>>> deterministic.
>>>>>>
>>>>>> In ThinLTO mode, the function summary contains the list of call edges
>>>>>> from the function. We propose to add the relative block frequency on these
>>>>>> edges. During the thinlink phase, we propagate the function counts on the
>>>>>> entire call graph and update the function summary with the synthetic
>>>>>> counts. Additionally, we plan to use the computed counts to drive the
>>>>>> importing decisions.
>>>>>>
>>>>>> Alternative approach
>>>>>>
>>>>>> -----------------------------
>>>>>>
>>>>>>
>>>>>> An alternative to generating synthetic counts is to make block
>>>>>> frequency info an inter-procedural analysis. Such an analysis would allow
>>>>>> comparing BFI of callsites in two different functions. This has several
>>>>>> downsides:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    The inter-procedural BFI computation is likely to be more
>>>>>>    expensive in terms of compile-time.
>>>>>>    -
>>>>>>
>>>>>>    Many function passes invalidate the BFI. This will require
>>>>>>    selective invalidation of function BFIs.
>>>>>>    -
>>>>>>
>>>>>>    Inliner correctly updates function counts of a callee after a
>>>>>>    callsite is inlined. We can piggyback on this mechanism by using synthetic
>>>>>>    counts.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171215/9ad628e8/attachment.html>


More information about the llvm-dev mailing list