[llvm-dev] RFC: Hotness thresholds in profile header

Thu Dec 17 11:17:48 PST 2015

On Thu, Dec 17, 2015 at 9:21 AM, Andy Ayers <andya at microsoft.com> wrote:
> While your bb count distribution is extremely likely to be some kind of power-law like distribution, it's not guaranteed.
>
> Also you might think about operations that can amplify (rerolling) or appear to amplify (TRE) or diminish BB counts, and how you'd go about reclassifying block hotness.

yes -- this is more of a problem for Sample based FDO because the
samples are collected after such transformations. We have a paper
discussing that (to appear cgo2016).

thanks,

Dvaid

>
>
> -----Original Message-----
> From: Xinliang David Li [mailto:davidxl at google.com]
> Sent: Wednesday, December 16, 2015 9:47 PM
> To: Andy Ayers <andya at microsoft.com>
> Cc: Easwaran Raman <eraman at google.com>; llvm-dev at lists.llvm.org
> Subject: Re: [llvm-dev] RFC: Hotness thresholds in profile header
>
> On Wed, Dec 16, 2015 at 6:00 PM, Andy Ayers <andya at microsoft.com> wrote:
>> I’ve done similar rankings with profile data and found it worked out
>> pretty well.
>>
>>
>>
>> In my case it was ranking function hotness but the idea was similar:
>> sort by weight, compute various percentile cutoff points, and use those to classify.
>> I put in some compensation for truly degenerate cases. Consider one
>> super-hot function that is 100x the weight of all remaining functions
>> – I’d just call that function hot and subtract it out of the
>> distribution and so spread a bit of the hotness around elsewhere.
>>
>
> This is probably more important if we only consider function entry count (to measure hotness). With Easwaran's proposal, the cut-off computation is based on sum of BB counts (sorted in descending order)
> -- the BB count sum approximate the execution time spent in those BBs.
>   For instance, if 99.9% of execution time is spent in top 10 BBs, the rest BBs are pretty much  irrelevant in terms of performance.
>
> thanks,
> David
>
>
>>
>>
>> Also, you should think about how to handle ties, so your sorting &
>> ranking is “stable” somehow.
>>
>>
>>
>> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>> Easwaran Raman via llvm-dev
>> Sent: Wednesday, December 16, 2015 2:43 PM
>> To: llvm-dev at lists.llvm.org
>> Cc: David Li <davidxl at google.com>
>> Subject: [llvm-dev] RFC: Hotness thresholds in profile header
>>
>>
>>
>> Hi,
>>
>> The problem we're trying to address:
>>
>> PGO transforms that are based on block counts need some way to answer
>> isHotBlock() query. A simple comparison of the form block_count > N,
>> for some constant N is not useful since even for the same program
>> different profile inputs can cause different answers for isHotBlock()
>> on the same block. This can be addressed by comparing against a
>> reference value that depends on the program and its input. For
>> instance, the indexed profile header contains MaxFunctionCount and
>> isHotBlock() could check if the block count exceeds a certain fraction of MaxFunctionCount.
>>
>>
>>
>> The drawback of this approach is that it is difficult to come up with
>> a single fraction that works both for programs with a few hot-spots
>> and programs with a long tail of block execution counts. Consider the
>> following two profiles both with a MaxFunctionCount of 1000. In the
>> first case, there are many blocks with count 100 and those (along with
>> the function with count
>> 1000) account for 99% of execution time. In the second case, there are
>> many blocks with count 1000 that together account for 99% of execution
>> time, but there are also a non-trivial number of blocks with count
>> 100. In the first case, we would like the hotness threshold to be 100
>> (or 10% of MaxFunctionCount), but in the second case, we want it to be
>> 1000 (100% of MaxFunctionCount).
>>
>>
>> Our proposal:
>>
>> We want to define hot blocks as those that account for a large
>> percentage(P) of execution time of the program. We then make a
>> simplistic assumption that all blocks take the same amount of time to
>> execute. With this assumption, the execution time of the program can
>> be approximated by the sum of execution counts of the blocks in the
>> program. We can then compute a maximum threshold T(P) such that
>> execution counts of blocks whose count is >= T(P) account for P percentage of total block count.
>>
>> To illustrate, let {B1...B5} be the set of blocks whose counts are {1,
>> 0, 2, 2, 5}. For this example:
>>
>> T(50) = 5 since {B5} accounts for 50% of total block count.
>> T(90) = 2 since {B3, B4, B5} has a cumulative count of 9 which is 90%
>> of total block count.
>> T(95) = 1 since all blocks with non-zero block counts are needed to
>> account for >=95% of total block count.
>>
>> We propose to record a vector of triples {P, T(P), B(P)}, where B(P)
>> is the number of blocks with count >= T(P), for a set of pre-defined
>> number of values of P into the header of the indexed profile. A
>> possible set of values of P are 90, 95, 99, 99.9, 99.99, but this
>> should be configurable through profdata tool.  The B(P) above could be
>> useful to optimization passes in making size/performance tradeoffs.
>> For example, if S is the average block size and S*B(95) fits well
>> within cache, but S*B(99) well exceeds the cache size, an optimization
>> like loop unroller could choose to use T(95) as the threshold to avoid heavy cache misses.
>>
>> The {P, T(P), B(P)} triples have to be attached to the module. A
>> common API would be added to retrieve the set of available Ps and
>> {T(P), B(P)} pair for a given P for all flavors (front-end, IR level and sample) of profiles.
>>
>> Thoughts?
>>
>>
>>
>> Thanks,
>>
>> Easwaran (with inputs from Teresa Johnson and David Li)