[llvm-dev] RFC: Hotness thresholds in profile header

Wed Dec 16 21:47:14 PST 2015

On Wed, Dec 16, 2015 at 6:00 PM, Andy Ayers <andya at microsoft.com> wrote:
> I’ve done similar rankings with profile data and found it worked out pretty
> well.
>
>
>
> In my case it was ranking function hotness but the idea was similar: sort by
> weight, compute various percentile cutoff points, and use those to classify.
> I put in some compensation for truly degenerate cases. Consider one
> super-hot function that is 100x the weight of all remaining functions – I’d
> just call that function hot and subtract it out of the distribution and so
> spread a bit of the hotness around elsewhere.
>

This is probably more important if we only consider function entry
count (to measure hotness). With Easwaran's proposal, the cut-off
computation is based on sum of BB counts (sorted in descending order)
-- the BB count sum approximate the execution time spent in those BBs.
  For instance, if 99.9% of execution time is spent in top 10 BBs, the
rest BBs are pretty much  irrelevant in terms of performance.

thanks,
David


>
>
> Also, you should think about how to handle ties, so your sorting & ranking
> is “stable” somehow.
>
>
>
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> Easwaran Raman via llvm-dev
> Sent: Wednesday, December 16, 2015 2:43 PM
> To: llvm-dev at lists.llvm.org
> Cc: David Li <davidxl at google.com>
> Subject: [llvm-dev] RFC: Hotness thresholds in profile header
>
>
>
> Hi,
>
> The problem we're trying to address:
>
> PGO transforms that are based on block counts need some way to answer
> isHotBlock() query. A simple comparison of the form block_count > N, for
> some constant N is not useful since even for the same program different
> profile inputs can cause different answers for isHotBlock() on the same
> block. This can be addressed by comparing against a reference value that
> depends on the program and its input. For instance, the indexed profile
> header contains MaxFunctionCount and isHotBlock() could check if the block
> count exceeds a certain fraction of MaxFunctionCount.
>
>
>
> The drawback of this approach is that it is difficult to come up with a
> single fraction that works both for programs with a few hot-spots and
> programs with a long tail of block execution counts. Consider the following
> two profiles both with a MaxFunctionCount of 1000. In the first case, there
> are many blocks with count 100 and those (along with the function with count
> 1000) account for 99% of execution time. In the second case, there are many
> blocks with count 1000 that together account for 99% of execution time, but
> there are also a non-trivial number of blocks with count 100. In the first
> case, we would like the hotness threshold to be 100 (or 10% of
> MaxFunctionCount), but in the second case, we want it to be 1000 (100% of
> MaxFunctionCount).
>
>
> Our proposal:
>
> We want to define hot blocks as those that account for a large percentage(P)
> of execution time of the program. We then make a simplistic assumption that
> all blocks take the same amount of time to execute. With this assumption,
> the execution time of the program can be approximated by the sum of
> execution counts of the blocks in the program. We can then compute a maximum
> threshold T(P) such that execution counts of blocks whose count is >= T(P)
> account for P percentage of total block count.
>
> To illustrate, let {B1...B5} be the set of blocks whose counts are {1, 0, 2,
> 2, 5}. For this example:
>
> T(50) = 5 since {B5} accounts for 50% of total block count.
> T(90) = 2 since {B3, B4, B5} has a cumulative count of 9 which is 90% of
> total block count.
> T(95) = 1 since all blocks with non-zero block counts are needed to account
> for >=95% of total block count.
>
> We propose to record a vector of triples {P, T(P), B(P)}, where B(P) is the
> number of blocks with count >= T(P), for a set of pre-defined number of
> values of P into the header of the indexed profile. A possible set of values
> of P are 90, 95, 99, 99.9, 99.99, but this should be configurable through
> profdata tool.  The B(P) above could be useful to optimization passes in
> making size/performance tradeoffs. For example, if S is the average block
> size and S*B(95) fits well within cache, but S*B(99) well exceeds the cache
> size, an optimization like loop unroller could choose to use T(95) as the
> threshold to avoid heavy cache misses.
>
> The {P, T(P), B(P)} triples have to be attached to the module. A common API
> would be added to retrieve the set of available Ps and {T(P), B(P)} pair for
> a given P for all flavors (front-end, IR level and sample) of profiles.
>
> Thoughts?
>
>
>
> Thanks,
>
> Easwaran (with inputs from Teresa Johnson and David Li)