[llvm-dev] RFC: Synthetic function entry counts

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Fri Dec 15 17:50:06 PST 2017


On Fri, Dec 15, 2017 at 1:28 PM, Xinliang David Li <davidxl at google.com>
wrote:

>
>
> On Fri, Dec 15, 2017 at 11:56 AM, Sean Silva <chisophugis at gmail.com>
> wrote:
>
>>
>>
>> On Fri, Dec 15, 2017 at 11:27 AM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Dec 15, 2017 at 11:13 AM, Sean Silva <chisophugis at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Dec 15, 2017 at 10:22 AM, Easwaran Raman <eraman at google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 15, 2017 at 12:22 AM, Sean Silva <chisophugis at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> IIUC, this proposal is just saying that we should infer a static
>>>>>> profile for entry counts just like we do for branch probabilities. In the
>>>>>> case of entry counts, we do not hide that information behind an analysis
>>>>>> like BPI, so currently just annotating synthetic PGO entry counts is a
>>>>>> simple solution that piggybacks on the PGO mechanism and Just Works.
>>>>>>
>>>>>> If that is correct then this makes perfect sense to me.
>>>>>>
>>>>> Yes, that is the proposal.
>>>>>
>>>>>>
>>>>>> It could be argued that we ought to refactor things so that the raw
>>>>>> PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls
>>>>>> back to interprocedural analysis (i.e. static profile heuristics) when the
>>>>>> entry count metadata is missing, just like BPI does with static
>>>>>> intraprocedural analysis and branch weight metadata. However, we probably
>>>>>> don't want to do that while folks are still depending on the old PM in
>>>>>> production since CGSCC analyses don't exist there which would force us to
>>>>>> maintain an old and new way of doing it.
>>>>>>
>>>>>> (Also, it sounds like you want to compute this with a top-down CGSCC
>>>>>> traversal, so it might not actually be computable incrementally as a bottom
>>>>>> up CGSCC analysis which is what CGSCC analyses currently do; an auxiliary
>>>>>> module analysis for the top-down part might work around this though)
>>>>>>
>>>>>
>>>>>>
>>>>> Wrapping function entry counts behind an analysis might be doable but
>>>>> requires more careful thought.
>>>>>
>>>>
>>>> Definitely.
>>>>
>>>>
>>>>> Right now getEntryCount is called from various passes: inliner, loop
>>>>> passes (unroll, loop sink), codegenprepare (for function layout), machine
>>>>> block placement, module summary analysis ... This might be worth attempting
>>>>> when we fully migrate to the new pass manager but not now.
>>>>>
>>>>
>>>> It seems like many of those places that directly reference
>>>> getEntryCount are using it to check "HasProfileData". So adding synthetic
>>>> values would cause us to take a lot of code paths in the compiler that were
>>>> intended to use real profile data.
>>>>
>>>> As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
>>>> MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
>>>> present (the comment even says "This needs precise profile data and we only
>>>> do this when profile data is available."). Do we need to refactor things
>>>> first to use a more semantically meaningful check, like
>>>> "hasRealProfileData" so that synthetic counts don't fool this code into
>>>> thinking it has real profile data? At least, comments like the one I quoted
>>>> need to be updated to indicate that a static profile is considered precise
>>>> enough.
>>>>
>>>>
>>>>
>>> I think the existing code that checks 'hasEntryCount()' for
>>> 'isProfileAvailable' information will need to be refactored to query
>>> 'isRealProfileAvailable'.
>>>
>>>
>>>>
>>>> Also, one other question: in the case of FullLTO, how do we ensure that
>>>> the synthetic profile data from static heuristics is consistent across
>>>> TU's? Do we need to recalculate the synthetic profile data? If we need to
>>>> recalculate it, presumably we need to know which profile data is synthetic
>>>> or not. E.g. in a scenario where most TU's are compiled with profile data,
>>>> but some are not (e.g. vendor or runtime lib delivered as bitcode),
>>>> recalculated synthetic counts at FullLTO time should honor real profile
>>>> counts but ignore synthetic counts from per-TU compilation.
>>>>
>>>> The PGO+FullLTO case with a runtime lib delivered as bitcode is not
>>>> hypothetical; this was actually a common configuration used in production
>>>> while I was at PlayStation. CC'ing Davide, River to confirm that they still
>>>> care about this case.
>>>>
>>>>
>>>> Also, more generally, can you clarify how your static entry count
>>>> calculation algorithm is expected to work in the presence of partial
>>>> profile data?
>>>>
>>>
>>> I think the BFI information from those modules with real profile data
>>> can be used for the IPA frequency propagation purpose, but the 'real' entry
>>> count information will need to be dropped.  The entry counts for those
>>> library functions are collected and aggregated from contexts of different
>>> hosting binaries and their usefulness is very questionable.
>>>
>>
>> The case at PlayStation was one where the vendor bitcode libraries simply
>> did not have PGO data at all.
>>
>>
>>>  They needs to be 'normalized' in the context of the target binary,
>>> which is what the synthetic entry count computation is doing.
>>>
>>
>> Real entry counts might still be useful. For example, in a GUI or web
>> server app there may be a large number of functions that are called only
>> indirectly via an event loop that might be in a system library (e.g. inside
>> a libevent or Qt .so file). The collected branch probabilities are not
>> enough to decide that one handler is much hotter than the other.
>> Is the plan to currently not support this case? (i.e., we would require
>> you to build the event loop library with PGO so that indirect call
>> profiling can infer the hotness data)
>>
>>
> Unless we can indirect call promote those handlers and inline them, the
> global entry info for this handlers won't matter much though.
>

What if two handlers call the same helper, but one handler is very hot
compared with the other? We should prefer inlining into the hot handler.

-- Sean Silva


>
> David
>
>
>
>
>> -- Sean Silva
>>
>>
>>> David
>>>
>>>
>>>
>>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>>
>>>>> Also, the need to run this logic (or similar logic) as a "ThinLTO
>>>>>> analysis" suggests not wedding it too much with the intricacies of the
>>>>>> IR-level pass management (although admittedly we already do that with the
>>>>>> inliner and then ThinLTO has to approximate those inlining decisions, so it
>>>>>> might not be the end of the world to have some divergence).
>>>>>>
>>>>>> -- Sean Silva
>>>>>>
>>>>>> On Dec 12, 2017 5:02 PM, "Easwaran Raman via llvm-dev" <
>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>> Functions in LLVM IR have a function_entry_count metadata that is
>>>>>>> attached in PGO compilation. By using the entry count together with the
>>>>>>> block frequency info, the compiler computes the profile count of call
>>>>>>> instructions based on which the hotness/coldness of callsites can be
>>>>>>> determined. Experiments have shown that using a higher threshold for hot
>>>>>>> callsites results in improved runtime performance of the generated code
>>>>>>> without significant code size increases. We propose to generate synthetic
>>>>>>> function counts for non-PGO compilation and use the counts for boosting hot
>>>>>>> callsites during inlining.
>>>>>>>
>>>>>>> Synthetic function entry counts of functions are initialized based
>>>>>>> on properties of the function such as whether it is visible outside the
>>>>>>> module, whether it has an inline keyword and so on. Then, the callgraph SCC
>>>>>>> is traversed in reverse post-order. Counts of callsites are determined
>>>>>>> based on the entry count and the block frequency of the callsite. The
>>>>>>> callsite count gets added to the entry count of the callee. For targets of
>>>>>>> indirect calls, we will use the !callees metadata to find the possible
>>>>>>> targets and distribute the count equally among them. For functions in a
>>>>>>> non-trivial SCC, the algorithm has to ensure that the counts are stable and
>>>>>>> deterministic.
>>>>>>>
>>>>>>> In ThinLTO mode, the function summary contains the list of call
>>>>>>> edges from the function. We propose to add the relative block frequency on
>>>>>>> these edges. During the thinlink phase, we propagate the function counts on
>>>>>>> the entire call graph and update the function summary with the synthetic
>>>>>>> counts. Additionally, we plan to use the computed counts to drive the
>>>>>>> importing decisions.
>>>>>>>
>>>>>>> Alternative approach
>>>>>>>
>>>>>>> -----------------------------
>>>>>>>
>>>>>>>
>>>>>>> An alternative to generating synthetic counts is to make block
>>>>>>> frequency info an inter-procedural analysis. Such an analysis would allow
>>>>>>> comparing BFI of callsites in two different functions. This has several
>>>>>>> downsides:
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    The inter-procedural BFI computation is likely to be more
>>>>>>>    expensive in terms of compile-time.
>>>>>>>    -
>>>>>>>
>>>>>>>    Many function passes invalidate the BFI. This will require
>>>>>>>    selective invalidation of function BFIs.
>>>>>>>    -
>>>>>>>
>>>>>>>    Inliner correctly updates function counts of a callee after a
>>>>>>>    callsite is inlined. We can piggyback on this mechanism by using synthetic
>>>>>>>    counts.
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> llvm-dev at lists.llvm.org
>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171215/d01cc279/attachment-0001.html>


More information about the llvm-dev mailing list