[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

Tue Aug 14 16:13:50 PDT 2018

The 486 "assume not taken" strategy (also common in other in-order 
cores) is not a branch prediction mechanism as such. It can be 
interpreted as one (in that it's essentially equivalent to predicting 
all branches not taken), but first and foremost it's simply what happens 
when the instruction fetch stage just keeps going through memory 
sequentially unless redirected.

In other words, it's what naturally happens in a pipelined processor in 
the _absence_ of any intentionally designed branch prediction mechanism.

As for static vs. dynamic prediction, to be making explicit static 
predictions, you actually need to know whether you have seen a given 
branch before, and many branch predictors don't! That is, they simply 
have no way to tell reliably whether a given branch history (or branch 
target) entry is actually for the branch it's trying to predict. The 
various branch prediction structures are mostly just bags of bits 
indexed by a hash code.

Now some do have a few tag bits that allow them to detect collisions, 
probabilistically with high likelihood anyway, but others don't. It's 
just a matter of whether the architects think it's worthwhile to spend 
the area on tag bits, or whether they have another thing they'd rather 
do that is a better trade-off for their target criteria.

Netburst (Pentium 4) is an anomaly, AFAIK because of the way the branch 
prediction interacts with the trace cache. Namely, I believe that branch 
history etc. in Netburst was kept at the trace level. For trace cache 
entries, unlike branch history/target entries, there is a well-defined 
moment when they're new, namely when they're initialized after a trace 
cache miss (trace cache miss detection _has_ to do full tagging and be 
precise for the CPU to work); that's an opportunity to set up the new 
trace with the static prediction heuristic, and also branch hints. [But 
I'm no expert on this, and I bet there are Intel folks on this list that 
could be more precise.]

Zooming out a bit, real designs don't generally have a single monolithic 
branch prediction mechanism; they have lots of them, in different places 
in the pipeline. Going roughly in pipeline order, some common mechanisms 
are:

1. Next I$ fetch line prediction. The address for the next instruction 
fetch needs to be determined before the current fetch even completes, 
for there to be no bubbles.

2. Early branch target prediction (indexed by instruction fetch 
address); instruction fetch is usually multiple cycles, and again you 
need to know the likely address to fetch next before you've even seen 
any instruction bytes or know where branches are!

3. A later, more precise target/direction predictor. The early 
predictors are "quick and dirty", and not very accurate. In parallel, 
you start working on a more accurate prediction; if the late predictor 
disagrees with the early predictor, you need to squash a few cycles 
worth of instructions, but this is still fairly early in the pipeline, 
so a lot cheaper than full mispredictions.

4. A return stack predictor for function returns, the most common type 
of indirect branch by far. Tracks function calls and returns and 
supplies what it thinks are likely targets for return statements.

5. A separate predictor for other indirect branch targets (e.g. through 
virtual or jump tables). BTBs can cover some of this but dedicated 
indirect predictors work better. Again, an indirect predictor can decide 
than an earlier-guessed branch target was bogus and initiate a partial 
flush.

5. There are sometimes also special loop predictors to catch counted 
loops that (almost) always execute the same number of iterations. 
Innermost loop nests are usually covered well by the regular global 
branch history, but loop predictors help with outer loop nests that are 
periodic on longer time scales than the global branch history captures.

6. Once instructions are decoded, branch target addresses (for direct 
branches) are verified. Branch target addresses in the BTBs and other 
structures are commonly stored with a reduced number of bits, and in 
certain (hopefully rare) cases the "decompressed" address will be 
incorrect. Again, if this is detected, a partial flush is initiated; 
we're now several cycles downstream from initial instruction fetch, but 
this is still far from a full branch misprediction penalty.

CPU vendors are usually close-mouthed about what exactly they're doing 
(it's some of their secret sauce), but some are more open than others. 
One not-too-old CPU design that has a relatively thorough description of 
the branch prediction architecture (though not the nitty-gritty details) 
is the AMD Jaguar (low-power x86):

   http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip

Section 2.7.1 in the PDF has a (very) brief description of the various 
branch prediction mechanisms.

-Fabian

On 8/14/2018 2:09 PM, 2016 quekong via llvm-dev wrote:
> ( I don't know if it's allowed to ask such question, if not, please remind me. )
> 
> I know Intel implemented several static branch prediction mechanisms
> these years:
>    * 80486 age: Always-not-take
>    * Pentium4 age: Backwards Taken/Forwards Not-Taken
>    * PM, Core2: Didn't use static prediction,  randomly depending on
> what happens to be in corresponding BTB entry , according to agner's
> optimization guide ¹.
>    * Newer CPUs like Ivy Bridge, Haswell have become increasingly
> intangible, according to Matt G's experiment ².
> 
> And Intel seems don't want to talk about it any more, because the
> latest material I found within Intel Document was written about ten
> years ago.
> 
> I know static branch prediction is (far?) less important than dynamic,
> but in quite a few situations, CPU will be completely lost and
> programmers(with compiler) are usually the best guide. Of course these
> situations are usually not performance bottleneck, because once a
> branch is frequently executed, the dynamic predictor will capture it.
> 
> Since Intel no longer clearly statements the dynamic prediction
> mechanism in its document, the builtin_expect() of GCC can do nothing
> more than removing the unlikely branch from hot path or reversely for
> likely branch.
> 
> I am not familiar with CPU design and I don't know what exactly
> mechanism Intel use nowadays for its static predictor, I just feel the
> best static mechanism for Intel should be to clearly document his CPU
> "where I plan to go when dynamic predictor failed, forward or
> backward", because usually the programmer is the best guide at that
> time.
> 
> 
> APPENDIX:
> ¹ Agner's optimization guide:
> https://www.agner.org/optimize/microarchitecture.pdf   ,  section 3.5
> .
> 
> ² Matt G's experiment: https://xania.org/201602/bpu-part-two
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>