[llvm-dev] RFC: Hotness thresholds in profile header

Fri Jan 8 12:01:06 PST 2016

Can you submit the patch for review?

David

On Fri, Jan 8, 2016 at 11:59 AM, Easwaran Raman <eraman at google.com> wrote:
> I have a patch at http://reviews.llvm.org/D16005 that implements this
> feature as part of 'show' command of the llvm-profdata. I think it'll be
> useful to implement this as part of show so that it can be tried on profiles
> of programs that people care about and refine it before getting this
> information into the header of the indexed profile.
>
> On Thu, Dec 17, 2015 at 11:17 AM, Xinliang David Li <davidxl at google.com>
> wrote:
>>
>> On Thu, Dec 17, 2015 at 9:21 AM, Andy Ayers <andya at microsoft.com> wrote:
>> > While your bb count distribution is extremely likely to be some kind of
>> > power-law like distribution, it's not guaranteed.
>> >
>> > Also you might think about operations that can amplify (rerolling) or
>> > appear to amplify (TRE) or diminish BB counts, and how you'd go about
>> > reclassifying block hotness.
>>
>> yes -- this is more of a problem for Sample based FDO because the
>> samples are collected after such transformations. We have a paper
>> discussing that (to appear cgo2016).
>>
>> thanks,
>>
>> Dvaid
>>
>> >
>> >
>> > -----Original Message-----
>> > From: Xinliang David Li [mailto:davidxl at google.com]
>> > Sent: Wednesday, December 16, 2015 9:47 PM
>> > To: Andy Ayers <andya at microsoft.com>
>> > Cc: Easwaran Raman <eraman at google.com>; llvm-dev at lists.llvm.org
>> > Subject: Re: [llvm-dev] RFC: Hotness thresholds in profile header
>> >
>> > On Wed, Dec 16, 2015 at 6:00 PM, Andy Ayers <andya at microsoft.com> wrote:
>> >> I’ve done similar rankings with profile data and found it worked out
>> >> pretty well.
>> >>
>> >>
>> >>
>> >> In my case it was ranking function hotness but the idea was similar:
>> >> sort by weight, compute various percentile cutoff points, and use those
>> >> to classify.
>> >> I put in some compensation for truly degenerate cases. Consider one
>> >> super-hot function that is 100x the weight of all remaining functions
>> >> – I’d just call that function hot and subtract it out of the
>> >> distribution and so spread a bit of the hotness around elsewhere.
>> >>
>> >
>> > This is probably more important if we only consider function entry count
>> > (to measure hotness). With Easwaran's proposal, the cut-off computation is
>> > based on sum of BB counts (sorted in descending order)
>> > -- the BB count sum approximate the execution time spent in those BBs.
>> >   For instance, if 99.9% of execution time is spent in top 10 BBs, the
>> > rest BBs are pretty much  irrelevant in terms of performance.
>> >
>> > thanks,
>> > David
>> >
>> >
>> >>
>> >>
>> >> Also, you should think about how to handle ties, so your sorting &
>> >> ranking is “stable” somehow.
>> >>
>> >>
>> >>
>> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>> >> Easwaran Raman via llvm-dev
>> >> Sent: Wednesday, December 16, 2015 2:43 PM
>> >> To: llvm-dev at lists.llvm.org
>> >> Cc: David Li <davidxl at google.com>
>> >> Subject: [llvm-dev] RFC: Hotness thresholds in profile header
>> >>
>> >>
>> >>
>> >> Hi,
>> >>
>> >> The problem we're trying to address:
>> >>
>> >> PGO transforms that are based on block counts need some way to answer
>> >> isHotBlock() query. A simple comparison of the form block_count > N,
>> >> for some constant N is not useful since even for the same program
>> >> different profile inputs can cause different answers for isHotBlock()
>> >> on the same block. This can be addressed by comparing against a
>> >> reference value that depends on the program and its input. For
>> >> instance, the indexed profile header contains MaxFunctionCount and
>> >> isHotBlock() could check if the block count exceeds a certain fraction
>> >> of MaxFunctionCount.
>> >>
>> >>
>> >>
>> >> The drawback of this approach is that it is difficult to come up with
>> >> a single fraction that works both for programs with a few hot-spots
>> >> and programs with a long tail of block execution counts. Consider the
>> >> following two profiles both with a MaxFunctionCount of 1000. In the
>> >> first case, there are many blocks with count 100 and those (along with
>> >> the function with count
>> >> 1000) account for 99% of execution time. In the second case, there are
>> >> many blocks with count 1000 that together account for 99% of execution
>> >> time, but there are also a non-trivial number of blocks with count
>> >> 100. In the first case, we would like the hotness threshold to be 100
>> >> (or 10% of MaxFunctionCount), but in the second case, we want it to be
>> >> 1000 (100% of MaxFunctionCount).
>> >>
>> >>
>> >> Our proposal:
>> >>
>> >> We want to define hot blocks as those that account for a large
>> >> percentage(P) of execution time of the program. We then make a
>> >> simplistic assumption that all blocks take the same amount of time to
>> >> execute. With this assumption, the execution time of the program can
>> >> be approximated by the sum of execution counts of the blocks in the
>> >> program. We can then compute a maximum threshold T(P) such that
>> >> execution counts of blocks whose count is >= T(P) account for P
>> >> percentage of total block count.
>> >>
>> >> To illustrate, let {B1...B5} be the set of blocks whose counts are {1,
>> >> 0, 2, 2, 5}. For this example:
>> >>
>> >> T(50) = 5 since {B5} accounts for 50% of total block count.
>> >> T(90) = 2 since {B3, B4, B5} has a cumulative count of 9 which is 90%
>> >> of total block count.
>> >> T(95) = 1 since all blocks with non-zero block counts are needed to
>> >> account for >=95% of total block count.
>> >>
>> >> We propose to record a vector of triples {P, T(P), B(P)}, where B(P)
>> >> is the number of blocks with count >= T(P), for a set of pre-defined
>> >> number of values of P into the header of the indexed profile. A
>> >> possible set of values of P are 90, 95, 99, 99.9, 99.99, but this
>> >> should be configurable through profdata tool.  The B(P) above could be
>> >> useful to optimization passes in making size/performance tradeoffs.
>> >> For example, if S is the average block size and S*B(95) fits well
>> >> within cache, but S*B(99) well exceeds the cache size, an optimization
>> >> like loop unroller could choose to use T(95) as the threshold to avoid
>> >> heavy cache misses.
>> >>
>> >> The {P, T(P), B(P)} triples have to be attached to the module. A
>> >> common API would be added to retrieve the set of available Ps and
>> >> {T(P), B(P)} pair for a given P for all flavors (front-end, IR level
>> >> and sample) of profiles.
>> >>
>> >> Thoughts?
>> >>
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Easwaran (with inputs from Teresa Johnson and David Li)
>
>