[llvm-dev] RFC: Synthetic function entry counts

Easwaran Raman via llvm-dev llvm-dev at lists.llvm.org
Tue Dec 19 16:56:41 PST 2017


On Fri, Dec 15, 2017 at 8:10 PM, Xinliang David Li <davidxl at google.com>
wrote:

>
>
> On Fri, Dec 15, 2017 at 5:50 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Fri, Dec 15, 2017 at 1:28 PM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Dec 15, 2017 at 11:56 AM, Sean Silva <chisophugis at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Dec 15, 2017 at 11:27 AM, Xinliang David Li <davidxl at google.com
>>>> > wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 15, 2017 at 11:13 AM, Sean Silva <chisophugis at gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 15, 2017 at 10:22 AM, Easwaran Raman <eraman at google.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 15, 2017 at 12:22 AM, Sean Silva <chisophugis at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> IIUC, this proposal is just saying that we should infer a static
>>>>>>>> profile for entry counts just like we do for branch probabilities. In the
>>>>>>>> case of entry counts, we do not hide that information behind an analysis
>>>>>>>> like BPI, so currently just annotating synthetic PGO entry counts is a
>>>>>>>> simple solution that piggybacks on the PGO mechanism and Just Works.
>>>>>>>>
>>>>>>>> If that is correct then this makes perfect sense to me.
>>>>>>>>
>>>>>>> Yes, that is the proposal.
>>>>>>>
>>>>>>>>
>>>>>>>> It could be argued that we ought to refactor things so that the raw
>>>>>>>> PGO metadata is only ever accessed via a wrapper CGSCC analysis that falls
>>>>>>>> back to interprocedural analysis (i.e. static profile heuristics) when the
>>>>>>>> entry count metadata is missing, just like BPI does with static
>>>>>>>> intraprocedural analysis and branch weight metadata. However, we probably
>>>>>>>> don't want to do that while folks are still depending on the old PM in
>>>>>>>> production since CGSCC analyses don't exist there which would force us to
>>>>>>>> maintain an old and new way of doing it.
>>>>>>>>
>>>>>>>> (Also, it sounds like you want to compute this with a top-down
>>>>>>>> CGSCC traversal, so it might not actually be computable incrementally as a
>>>>>>>> bottom up CGSCC analysis which is what CGSCC analyses currently do; an
>>>>>>>> auxiliary module analysis for the top-down part might work around this
>>>>>>>> though)
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>> Wrapping function entry counts behind an analysis might be doable
>>>>>>> but requires more careful thought.
>>>>>>>
>>>>>>
>>>>>> Definitely.
>>>>>>
>>>>>>
>>>>>>> Right now getEntryCount is called from various passes: inliner, loop
>>>>>>> passes (unroll, loop sink), codegenprepare (for function layout), machine
>>>>>>> block placement, module summary analysis ... This might be worth attempting
>>>>>>> when we fully migrate to the new pass manager but not now.
>>>>>>>
>>>>>>
>>>>>> It seems like many of those places that directly reference
>>>>>> getEntryCount are using it to check "HasProfileData". So adding synthetic
>>>>>> values would cause us to take a lot of code paths in the compiler that were
>>>>>> intended to use real profile data.
>>>>>>
>>>>>> As a simple example, MachineBlockPlacement::collectLoopBlockSet uses
>>>>>> MBFI and MBPI guarded by a check of getEntryCount to see if PGO data is
>>>>>> present (the comment even says "This needs precise profile data and we only
>>>>>> do this when profile data is available."). Do we need to refactor things
>>>>>> first to use a more semantically meaningful check, like
>>>>>> "hasRealProfileData" so that synthetic counts don't fool this code into
>>>>>> thinking it has real profile data? At least, comments like the one I quoted
>>>>>> need to be updated to indicate that a static profile is considered precise
>>>>>> enough.
>>>>>>
>>>>>>
>>>>>>
>>>>> I think the existing code that checks 'hasEntryCount()' for
>>>>> 'isProfileAvailable' information will need to be refactored to query
>>>>> 'isRealProfileAvailable'.
>>>>>
>>>>>
>>>>>>
>>>>>> Also, one other question: in the case of FullLTO, how do we ensure
>>>>>> that the synthetic profile data from static heuristics is consistent across
>>>>>> TU's? Do we need to recalculate the synthetic profile data? If we need to
>>>>>> recalculate it, presumably we need to know which profile data is synthetic
>>>>>> or not. E.g. in a scenario where most TU's are compiled with profile data,
>>>>>> but some are not (e.g. vendor or runtime lib delivered as bitcode),
>>>>>> recalculated synthetic counts at FullLTO time should honor real profile
>>>>>> counts but ignore synthetic counts from per-TU compilation.
>>>>>>
>>>>>> The PGO+FullLTO case with a runtime lib delivered as bitcode is not
>>>>>> hypothetical; this was actually a common configuration used in production
>>>>>> while I was at PlayStation. CC'ing Davide, River to confirm that they still
>>>>>> care about this case.
>>>>>>
>>>>>>
>>>>>> Also, more generally, can you clarify how your static entry count
>>>>>> calculation algorithm is expected to work in the presence of partial
>>>>>> profile data?
>>>>>>
>>>>>
>>>>> I think the BFI information from those modules with real profile data
>>>>> can be used for the IPA frequency propagation purpose, but the 'real' entry
>>>>> count information will need to be dropped.  The entry counts for those
>>>>> library functions are collected and aggregated from contexts of different
>>>>> hosting binaries and their usefulness is very questionable.
>>>>>
>>>>
>>>> The case at PlayStation was one where the vendor bitcode libraries
>>>> simply did not have PGO data at all.
>>>>
>>>>
>>>>>  They needs to be 'normalized' in the context of the target binary,
>>>>> which is what the synthetic entry count computation is doing.
>>>>>
>>>>
>>>> Real entry counts might still be useful. For example, in a GUI or web
>>>> server app there may be a large number of functions that are called only
>>>> indirectly via an event loop that might be in a system library (e.g. inside
>>>> a libevent or Qt .so file). The collected branch probabilities are not
>>>> enough to decide that one handler is much hotter than the other.
>>>> Is the plan to currently not support this case? (i.e., we would require
>>>> you to build the event loop library with PGO so that indirect call
>>>> profiling can infer the hotness data)
>>>>
>>>>
>>> Unless we can indirect call promote those handlers and inline them, the
>>> global entry info for this handlers won't matter much though.
>>>
>>
>> What if two handlers call the same helper, but one handler is very hot
>> compared with the other? We should prefer inlining into the hot handler.
>>
>
> yes, the entry count can differentiate in this case. On the other hand,
> with partial profile, we really do not have a good/complete global profile
> summary information. This makes it harder to guess the global hotness even
> with entry count.
>
Yes, this is the main challenge for the partial profile case and I do not
plan to handle the mixed-profile case initially.



>
> By the way, the partial profile data case is probably also interesting to
> us -- but in the context of sample PGO.  We will need to find a good way to
> deal with it once we have the basic infrastructure ready for synthetic
> profile data.
>
> thanks,
>
> David
>
>
>>
>> -- Sean Silva
>>
>>
>>>
>>> David
>>>
>>>
>>>
>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> -- Sean Silva
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also, the need to run this logic (or similar logic) as a "ThinLTO
>>>>>>>> analysis" suggests not wedding it too much with the intricacies of the
>>>>>>>> IR-level pass management (although admittedly we already do that with the
>>>>>>>> inliner and then ThinLTO has to approximate those inlining decisions, so it
>>>>>>>> might not be the end of the world to have some divergence).
>>>>>>>>
>>>>>>>> -- Sean Silva
>>>>>>>>
>>>>>>>> On Dec 12, 2017 5:02 PM, "Easwaran Raman via llvm-dev" <
>>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>>> Functions in LLVM IR have a function_entry_count metadata that is
>>>>>>>>> attached in PGO compilation. By using the entry count together with the
>>>>>>>>> block frequency info, the compiler computes the profile count of call
>>>>>>>>> instructions based on which the hotness/coldness of callsites can be
>>>>>>>>> determined. Experiments have shown that using a higher threshold for hot
>>>>>>>>> callsites results in improved runtime performance of the generated code
>>>>>>>>> without significant code size increases. We propose to generate synthetic
>>>>>>>>> function counts for non-PGO compilation and use the counts for boosting hot
>>>>>>>>> callsites during inlining.
>>>>>>>>>
>>>>>>>>> Synthetic function entry counts of functions are initialized based
>>>>>>>>> on properties of the function such as whether it is visible outside the
>>>>>>>>> module, whether it has an inline keyword and so on. Then, the callgraph SCC
>>>>>>>>> is traversed in reverse post-order. Counts of callsites are determined
>>>>>>>>> based on the entry count and the block frequency of the callsite. The
>>>>>>>>> callsite count gets added to the entry count of the callee. For targets of
>>>>>>>>> indirect calls, we will use the !callees metadata to find the possible
>>>>>>>>> targets and distribute the count equally among them. For functions in a
>>>>>>>>> non-trivial SCC, the algorithm has to ensure that the counts are stable and
>>>>>>>>> deterministic.
>>>>>>>>>
>>>>>>>>> In ThinLTO mode, the function summary contains the list of call
>>>>>>>>> edges from the function. We propose to add the relative block frequency on
>>>>>>>>> these edges. During the thinlink phase, we propagate the function counts on
>>>>>>>>> the entire call graph and update the function summary with the synthetic
>>>>>>>>> counts. Additionally, we plan to use the computed counts to drive the
>>>>>>>>> importing decisions.
>>>>>>>>>
>>>>>>>>> Alternative approach
>>>>>>>>>
>>>>>>>>> -----------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> An alternative to generating synthetic counts is to make block
>>>>>>>>> frequency info an inter-procedural analysis. Such an analysis would allow
>>>>>>>>> comparing BFI of callsites in two different functions. This has several
>>>>>>>>> downsides:
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    The inter-procedural BFI computation is likely to be more
>>>>>>>>>    expensive in terms of compile-time.
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    Many function passes invalidate the BFI. This will require
>>>>>>>>>    selective invalidation of function BFIs.
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    Inliner correctly updates function counts of a callee after a
>>>>>>>>>    callsite is inlined. We can piggyback on this mechanism by using synthetic
>>>>>>>>>    counts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> LLVM Developers mailing list
>>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171219/084131b9/attachment-0001.html>


More information about the llvm-dev mailing list