[LLVMdev] [RFC] BlockFrequency is the wrong metric; we need a new one

Sun Feb 2 18:18:12 PST 2014

On Feb 2, 2014, at 2:13 AM, Chandler Carruth <chandlerc at gmail.com> wrote:

> Right now, all profile information is funneled through two analysis passes prior to any part of the optimizer using it.
> 
> First, we have BranchProbabilityInfo, which provides a simple interface to the simplest form of profile information: local and relative branch probabilities. These merely express the likelihood of taking one of a mutually exclusive set of exit paths from a basic block. They are very simple, and the foundation of the profile information. Even the other analysis is merely built on top of this one.
> 
> Second we have BlockFrequencyInfo which attempts to provide a more "global" (function-wide, not actually program wide) view of the statistical frequency with which any particular basic block is executed. This is nicely principled analysis that just computes the probabilistic flow of control through the various branches according to their probabilities established in the first analysis.
> 
> 
> However, I think that BlockFrequencyInfo provides the wrong set of information. There is one critical reason why. Let's take a totally uninteresting CFG:
> 
> A -> B1, B2
> B1 -> C1, C2
> B2 -> C3, C4
> C1 -> D1, D2
> C2 -> ret
> C3 -> D3, D4
> C4 -> ret
> D1 -> E1, E2
> D2 -> ret
> D3 -> E3, E4
> D4 -> ret
> 
> You can imagine this repeating on for as many levels as you like. This isn't an uncommon situation with real code. BlockFrequencyInfo computes for this a very logical answer:
> 
> ---- Block Freqs ----
>  a = 1.0
>   a -> b1 = 0.5
>   a -> b2 = 0.5
>  b1 = 0.5
>   b1 -> c1 = 0.25
>   b1 -> c2 = 0.25
>  b2 = 0.5
>   b2 -> c3 = 0.25
>   b2 -> c4 = 0.25
>  c1 = 0.25
>   c1 -> d1 = 0.125
>   c1 -> d2 = 0.125
>  c2 = 0.25
>  c3 = 0.25
>   c3 -> d3 = 0.125
>   c3 -> d4 = 0.125
>  c4 = 0.25
>  d1 = 0.125
>   d1 -> e1 = 0.0625
>   d1 -> e2 = 0.0625
>  d2 = 0.125
>  d3 = 0.125
>   d3 -> e3 = 0.0625
>   d3 -> e4 = 0.0625
>  d4 = 0.125
> 
> One way of thinking about this is that for any basic block X which is predicated by N branches of unbiased probability (50/50), the frequency computed for that block is 2^(-N).
> 
> The problem is that this doesn't represent anything close to reality. Processors' branch prediction works precisely because very, *very* few branches in programs are 50/50. Most programs do not systematically explore breadth first the full diversity of paths through the program. And yet, in the absense of better information our heuristics would lead us to believe (and act!) as though this were true.
> 
> Now, I'm not saying that the computation of block frequencies is wrong. Merely that it cannot possibly be used for at least one of it purposes -- it's relative frequency (to the entry block "basis" frequency) is completely useless for detecting hot or cold regions of a function -- it will simply claim that all regions of the function are cold. What it *is* useful for is establishing a total ordering over the basic blocks of a function. So it works well for some things like code layout, but is grossly misleading for others.
> 
> 
> There are several possible solutions here. I'll outline my proposal as well as some other ideas.
> 
> 
> BlockWeights instead of BlockFrequencies. My idea is that we don't really care about the depth of the control dependence for a particular basic block. We care about the accumulated *bias* toward or away from a basic block. This is predicated on the idea that branches are overwhelmingly predictable. As a consequence, evenly distributed probabilities are really just *uncertainties*.

This sounds like an interesting way to merge unkown branch probabilities, static heuristics, statistical profiles, and partial profiles.

> My proposed way of implementing this would take the exact algorithm already used in BlockFrequency, but instead of computing a frequency for an edge based on the probability of the branch times the frequency of the predecessor, instead compute it as the frequency of the predecessor times how "biased" the branch is relative to other branches in the predecessor. Essentially, this would make a branch probability set of {0.5, 0.5} produce edge frequencies equal to the predecessor's edge frequency.
> 
> The result of such a system would produce weights for every block in the above CFG as '1.0', or equivalent to the entry block weight. This to me is a really useful metric -- it indicates that no block in the CFG is really more or less likely than any other. Only *biases* in a specific direction would cause the block weights to go up or down significantly.

I don't like this statement, or don't understand it. It is useful to know a branch is unbiased. Currently we assume branches are unbiased then optimize conservatively in those cases (do no harm). But if we had greater confidence in unbiased branches (because the branch was actually profiled), we could if-convert much more aggressively.

> This would immediately make the analysis useful to consumers such as the vectorizer, unroller, or when we have the capability, the inliner and an outliner which respect cold regions of functions.

I'm a little concerned that you're adding complexity that may not be necessary. Algorithms like if-conversion make use of the fact that the block weights are additive. I'm sure they could be adapted to something more sophisticated, but these heuristics are notoriously hard to tune. There is value in a simple model. You say that your approach is simple, but I would have to think hard about handling block frequencies that are inconsistent with their immediate branch probabilities.

So the question is: why do you really need more sophistication?

Part of you problem is that you're looking at frequency relative to the entry block when you should be looking at frequency relative to the region being optimized. When deciding to unroll, compare the loop header with the preheader.

I think you're trying to avoid unrolling even high trip count loops that are  seldom reached, and you're using a non-negligible frequency for "cold" code (e.g. > 2^-10). In that case it sounds like we need a static heuristic to handle the pattern of chains of branches with no CFG merge. The fall-thru path should be > 0.5.

So, I think this is a cool idea. If I better understood how it worked I might be less concerned about the complexity. I'm not very convinced that it is necessary yet.

-Andy

> One alternative proposed by Nick was using the average of block frequencies across a domtree as the denominator to which a particular block's frequency is compared. I haven't really thought as much about this because it seems quite expensive to compute and is less intuitive to me, but I wanted to mention it. Nick might be able to provide an explanation of it that would help folks.
> 
> Related to the idea of using the domtree, I can think of many layers on top of the existing block frequencies that could be used to accurately do "cold region" detection, but they would all be significantly more analysis on top of the existing system and so are somewhat less appealing to me.
> 
> Perhaps others have still more ideas? Also comments or suggestions on mine are very welcome. 
> 
> I have a really simple patch that implements my suggestion using the existing frequency infrastructure (as it is almost exactly the same). I would also want to rename everything to avoid confusion as these would no longer be "frequencies" in any strict sense. I can provide test data as needed on the result of this change if people like the direction.
> 
> -Chandler