[LLVMdev] [RFC] BlockFrequency is the wrong metric; we need a new one

Mon Feb 3 00:13:29 PST 2014

On Feb 2, 2014, at 6:55 PM, Chandler Carruth <chandlerc at gmail.com> wrote:

> On Sun, Feb 2, 2014 at 6:18 PM, Andrew Trick <atrick at apple.com> wrote:
> 
> On Feb 2, 2014, at 2:13 AM, Chandler Carruth <chandlerc at gmail.com> wrote:
> 
> > Right now, all profile information is funneled through two analysis passes prior to any part of the optimizer using it.
> >
> > First, we have BranchProbabilityInfo, which provides a simple interface to the simplest form of profile information: local and relative branch probabilities. These merely express the likelihood of taking one of a mutually exclusive set of exit paths from a basic block. They are very simple, and the foundation of the profile information. Even the other analysis is merely built on top of this one.
> >
> > Second we have BlockFrequencyInfo which attempts to provide a more "global" (function-wide, not actually program wide) view of the statistical frequency with which any particular basic block is executed. This is nicely principled analysis that just computes the probabilistic flow of control through the various branches according to their probabilities established in the first analysis.
> >
> >
> > However, I think that BlockFrequencyInfo provides the wrong set of information. There is one critical reason why. Let's take a totally uninteresting CFG:
> >
> > A -> B1, B2
> > B1 -> C1, C2
> > B2 -> C3, C4
> > C1 -> D1, D2
> > C2 -> ret
> > C3 -> D3, D4
> > C4 -> ret
> > D1 -> E1, E2
> > D2 -> ret
> > D3 -> E3, E4
> > D4 -> ret
> >
> > You can imagine this repeating on for as many levels as you like. This isn't an uncommon situation with real code. BlockFrequencyInfo computes for this a very logical answer:
> >
> > ---- Block Freqs ----
> >  a = 1.0
> >   a -> b1 = 0.5
> >   a -> b2 = 0.5
> >  b1 = 0.5
> >   b1 -> c1 = 0.25
> >   b1 -> c2 = 0.25
> >  b2 = 0.5
> >   b2 -> c3 = 0.25
> >   b2 -> c4 = 0.25
> >  c1 = 0.25
> >   c1 -> d1 = 0.125
> >   c1 -> d2 = 0.125
> >  c2 = 0.25
> >  c3 = 0.25
> >   c3 -> d3 = 0.125
> >   c3 -> d4 = 0.125
> >  c4 = 0.25
> >  d1 = 0.125
> >   d1 -> e1 = 0.0625
> >   d1 -> e2 = 0.0625
> >  d2 = 0.125
> >  d3 = 0.125
> >   d3 -> e3 = 0.0625
> >   d3 -> e4 = 0.0625
> >  d4 = 0.125
> >
> > One way of thinking about this is that for any basic block X which is predicated by N branches of unbiased probability (50/50), the frequency computed for that block is 2^(-N).
> >
> > The problem is that this doesn't represent anything close to reality. Processors' branch prediction works precisely because very, *very* few branches in programs are 50/50. Most programs do not systematically explore breadth first the full diversity of paths through the program. And yet, in the absense of better information our heuristics would lead us to believe (and act!) as though this were true.
> >
> > Now, I'm not saying that the computation of block frequencies is wrong. Merely that it cannot possibly be used for at least one of it purposes -- it's relative frequency (to the entry block "basis" frequency) is completely useless for detecting hot or cold regions of a function -- it will simply claim that all regions of the function are cold. What it *is* useful for is establishing a total ordering over the basic blocks of a function. So it works well for some things like code layout, but is grossly misleading for others.
> >
> >
> > There are several possible solutions here. I'll outline my proposal as well as some other ideas.
> >
> >
> > BlockWeights instead of BlockFrequencies. My idea is that we don't really care about the depth of the control dependence for a particular basic block. We care about the accumulated *bias* toward or away from a basic block. This is predicated on the idea that branches are overwhelmingly predictable. As a consequence, evenly distributed probabilities are really just *uncertainties*.
> 
> This sounds like an interesting way to merge unkown branch probabilities, static heuristics, statistical profiles, and partial profiles.
> 
> > My proposed way of implementing this would take the exact algorithm already used in BlockFrequency, but instead of computing a frequency for an edge based on the probability of the branch times the frequency of the predecessor, instead compute it as the frequency of the predecessor times how "biased" the branch is relative to other branches in the predecessor. Essentially, this would make a branch probability set of {0.5, 0.5} produce edge frequencies equal to the predecessor's edge frequency.
> >
> > The result of such a system would produce weights for every block in the above CFG as '1.0', or equivalent to the entry block weight. This to me is a really useful metric -- it indicates that no block in the CFG is really more or less likely than any other. Only *biases* in a specific direction would cause the block weights to go up or down significantly.
> 
> I don't like this statement, or don't understand it. It is useful to know a branch is unbiased. Currently we assume branches are unbiased then optimize conservatively in those cases (do no harm). But if we had greater confidence in unbiased branches (because the branch was actually profiled), we could if-convert much more aggressively.
> 
> I think the key is that I'm not talking about changing how we model branches at all. I agree that it would be useful to model both our confidence in the relative probabilities and the probabilities themselves for things like if-conversion. We can still do that.
> 
> But the current users of 'block frequency' are not trying to understand a specific branch as biased or un-biased, or deal with our confidence. They're just trying to understand how "important" a specific basic block is to the execution of the function.

Yes, I understand that. My concern was that passes may assume frequency/weight is consistent with probability (path weights add up to their merge/join points). At least, I'm used to making that assumption. Take a simple CFG:

A
|\
B C
| |\
D E F
| |/
|/
G

If each basic block has cost=1. Path A,B,D,G has weighted cost=4. Path A,C,E|F,G has weighted cost=5. But they should be equal.

>  
> > This would immediately make the analysis useful to consumers such as the vectorizer, unroller, or when we have the capability, the inliner and an outliner which respect cold regions of functions.
> 
> I'm a little concerned that you're adding complexity that may not be necessary. Algorithms like if-conversion make use of the fact that the block weights are additive.
> 
> Are you talking about branch probabilities here? I don't believe we use block frequencies in anything other than:
> 1) MachineBlockPlacement
> 2) Spill cost, placement, and generally the register allocator
> 3) StackSlotColoring (but really only for spill weight computation)
> 4) LoopVectorizer (disabled currently)
> 
> I just want to make sure we're on the same page. I don't think it really invalidates your concern.

You're right. I thought early if-conversion was multiplying the cost of each path by its weight. We aren't doing that. Instead we're taking a do-no-harm approach of estimating critical path increase without considering probability. It's actually more consistent with your approach.

> I'm sure they could be adapted to something more sophisticated, but these heuristics are notoriously hard to tune. There is value in a simple model. You say that your approach is simple, but I would have to think hard about handling block frequencies that are inconsistent with their immediate branch probabilities.
> 
> If they were inconsistent, I would agree. But they are in fact consistent and additive. They even preserve information during CFG merge points. The only difference is that they scale (both up and down) relative to the bias of a branch (probability's delta from the mean) rather than relative to the flat probability.
> 
> So I don't think this makes the computation, propagation or mathematical model more complex. What it does is shift it from a flow frequency modeling to biased branch modeling. That is certainly a change and worth justifying, I just don't want you to imagine it as having more impact than it does.
> 
> So the question is: why do you really need more sophistication?
> 
> Part of you problem is that you're looking at frequency relative to the entry block when you should be looking at frequency relative to the region being optimized. When deciding to unroll, compare the loop header with the preheader.
> 
> Actually, I did that already. ;] I really thought that would work back when we were first discussing how to use block frequencies and have not forgotten it. However, that doesn't actually capture the the frequency relative to the region. I think an example is best. Let's imagine that in my CFG above "E1" is a loop preheader. There is some loop header below it in the CFG that I hadn't reached yet.
> 
> The first thing to realize is that the heuristic we want is not how hot (or cold) the loop header is relative to the preheader. That tells us, once we execute this loop, how many times are we likely to take the backedge (or for a nested loop, one of the backedges). But this number may be correctly be arbitrarily large if we have a loop in a cold region which when executed loops for a very long time. What we actually want to check for is how frequently we enter the loop from outside the loop at all.
> 
> So the test is how "hot" is the loop preheader relative to its "region". Now, if the loop preheader is nested inside of another loop, we might usefully compare the inner loop's preheader to the outer loop's header and get a useful metric of "hot" or "cold". But what happens when we are trying to test for this in the outer most loop? We need some way to define a "region" of the function body itself then. There are a bunch of viable definitions. The one which maps to the loop header for a loop nest is the function entry block. However, that is the precise circumstance that doesn't work -- in a branch program, most blocks are "cold" relative to the function entry if we simply trust the block frequency.
> 
> Now, there are other ways to define a "region". I mentioned one as an alternative to my proposal -- we could look at block frequencies only relative to the root of their domtree. Another would be to use SESE regions, or any number of others. However, they all have the problem that we must do substantial work and add substantial complexity and sophistication to compute the region, the frequency for that region, and then compare ours against it.
> 
> As a practical matter, they also have the problem of accumulating significant error rapidly due to continual division of the entry frequency. My proposal actually mitigates such problems significantly.
> 
> 
> I think you're trying to avoid unrolling even high trip count loops that are  seldom reached, and you're using a non-negligible frequency for "cold" code (e.g. > 2^-10). In that case it sounds like we need a static heuristic to handle the pattern of chains of branches with no CFG merge. The fall-thru path should be > 0.5.
> 
> I'm really not looking at a single problem or solution. There are a wide variety of consumers of profile information that want to specially handle both cold and hot regions of a function. Looking at the frequencies as they exist today, these transformations cannot conclude either cold or hot for a region of code from them. We need something better.
> 
> If we had any static heuristic that was good at predicting *which* long chain should be >0.5 probability, we would be using it already. The problem is that there is no single idea of "fallthrough". We canonicalize the branches, and even programmers cannot make up their mind. Is the early exit the fast-path or is the early exit the error handling? Both patterns exist widely. If we have raw profile information, then yes, this is *much* less of a problem because we will actually know what direction all of the predictable branches in the program actually take. But our analyses currently are deceptive in the absence of strong profile information by giving the impression that the resulting branches are *unpredictable* instead of being predictable and unknown.
> 
> So, I think this is a cool idea. If I better understood how it worked I might be less concerned about the complexity. I'm not very convinced that it is necessary yet.
> 
> Ok, for the necessity, simply take any function from a simple benchmark, and look at its static block frequencies. Arnold posted a great example from libquantum. Here we have the most boring code (1 loop in a tiny basic block) and in a boring benchmark that is completely predictable. Yet still, we think that the loop is reached less than 20% of the times that the function is called. That just doesn't make sense.

That makes a fair amount of sense to me, but I see your point that it’s not particularly useful.

I think there are three sets of facts we’re trying to ascertain from block frequencies:

(1) Are blocks expected to be executed more or less frequently than their neighbors? This is effectively a way to summarize the CFG with a single block frequency number. Consider a global code motion pass. If it is guided by block frequencies, it doesn't need to know about loops or CFG regions. It always places instruction outside loops and sinks them below branches and above merges. Spill placement is similar.

(2) Does a loop iterate enough times to justify loop versioning or unrolling?

(3) Which blocks are "cold"? In other words identity blocks that are almost never reached and not worth optimizing if it means increasing code size and possibly compile time.

The proposed change makes (3) easier in case of deep acyclic control flow. It allows us to more aggressively prune the optimization space.

(2) is not currently problem. But with the new approach we could not simply estimate trip count as freq(header) / freq(preheader).

Regarding (1). You've probably already considered any impact on block layout, so I'll ignore that. For code motion/placement, I'm also not sure about your method of "scaling upward" within acyclic control flow. I'm used to thinking that if a block has higher weight it must be more frequently executed. So if I place code on the lightest block, I am either (a) hoisting it outside a loop or (b) sinking it below branches or above merges.

With a new approach where we scale for a branch bias, if you have a block with frequency 1.0 terminated by a branch with 0.25/0.75 probability, regardless of whether it's profiled or subject to static heuristic, I could imagine giving its successors frequency 0.33 and 1.0. I think what you’re proposing is to give the successors frequency 0.5 and 1.5. That’s  weird to me because I don’t know how we would identify loops from the frequency info.

Also, I honestly don’t see how the block frequency analysis will work. It’s not clear to me how you can compute cyclic probability for the loops if you’ve been scaling frequencies along the way.

-Andy

> If you want to take a more complex approach, we could definitely add a confidence dimension to the branch probability. This would allow us to express a high-confidence un-predictable branch and optimize based on that. However, we would *still* want block weights to be computed more like I have proposed. Instead of only branch probabilities that were far outside of 50/50 influencing it, we would want branch probabilities that were highly confident to influence it. The practical result would be exactly the same as what I have proposed, it is just that I'm making the simplifying assumption that at the moment, there are no evenly distributed branch probabilities which we have high confidence in.
> 
> My suspicion is that we will find even with direct profile information, this will still be true. Mostly, I suspect there are relatively few cases where it is important we model an even distribution of branches with high probability. I'm happy to be wrong about this, but *that* is where I want to wait for more data to make a decision that adds complexity to the system.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140203/9f46e553/attachment.html>