[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?
Fabian Giesen via llvm-dev
llvm-dev at lists.llvm.org
Tue Aug 14 16:13:50 PDT 2018
The 486 "assume not taken" strategy (also common in other in-order
cores) is not a branch prediction mechanism as such. It can be
interpreted as one (in that it's essentially equivalent to predicting
all branches not taken), but first and foremost it's simply what happens
when the instruction fetch stage just keeps going through memory
sequentially unless redirected.
In other words, it's what naturally happens in a pipelined processor in
the _absence_ of any intentionally designed branch prediction mechanism.
As for static vs. dynamic prediction, to be making explicit static
predictions, you actually need to know whether you have seen a given
branch before, and many branch predictors don't! That is, they simply
have no way to tell reliably whether a given branch history (or branch
target) entry is actually for the branch it's trying to predict. The
various branch prediction structures are mostly just bags of bits
indexed by a hash code.
Now some do have a few tag bits that allow them to detect collisions,
probabilistically with high likelihood anyway, but others don't. It's
just a matter of whether the architects think it's worthwhile to spend
the area on tag bits, or whether they have another thing they'd rather
do that is a better trade-off for their target criteria.
Netburst (Pentium 4) is an anomaly, AFAIK because of the way the branch
prediction interacts with the trace cache. Namely, I believe that branch
history etc. in Netburst was kept at the trace level. For trace cache
entries, unlike branch history/target entries, there is a well-defined
moment when they're new, namely when they're initialized after a trace
cache miss (trace cache miss detection _has_ to do full tagging and be
precise for the CPU to work); that's an opportunity to set up the new
trace with the static prediction heuristic, and also branch hints. [But
I'm no expert on this, and I bet there are Intel folks on this list that
could be more precise.]
Zooming out a bit, real designs don't generally have a single monolithic
branch prediction mechanism; they have lots of them, in different places
in the pipeline. Going roughly in pipeline order, some common mechanisms
are:
1. Next I$ fetch line prediction. The address for the next instruction
fetch needs to be determined before the current fetch even completes,
for there to be no bubbles.
2. Early branch target prediction (indexed by instruction fetch
address); instruction fetch is usually multiple cycles, and again you
need to know the likely address to fetch next before you've even seen
any instruction bytes or know where branches are!
3. A later, more precise target/direction predictor. The early
predictors are "quick and dirty", and not very accurate. In parallel,
you start working on a more accurate prediction; if the late predictor
disagrees with the early predictor, you need to squash a few cycles
worth of instructions, but this is still fairly early in the pipeline,
so a lot cheaper than full mispredictions.
4. A return stack predictor for function returns, the most common type
of indirect branch by far. Tracks function calls and returns and
supplies what it thinks are likely targets for return statements.
5. A separate predictor for other indirect branch targets (e.g. through
virtual or jump tables). BTBs can cover some of this but dedicated
indirect predictors work better. Again, an indirect predictor can decide
than an earlier-guessed branch target was bogus and initiate a partial
flush.
5. There are sometimes also special loop predictors to catch counted
loops that (almost) always execute the same number of iterations.
Innermost loop nests are usually covered well by the regular global
branch history, but loop predictors help with outer loop nests that are
periodic on longer time scales than the global branch history captures.
6. Once instructions are decoded, branch target addresses (for direct
branches) are verified. Branch target addresses in the BTBs and other
structures are commonly stored with a reduced number of bits, and in
certain (hopefully rare) cases the "decompressed" address will be
incorrect. Again, if this is detected, a partial flush is initiated;
we're now several cycles downstream from initial instruction fetch, but
this is still far from a full branch misprediction penalty.
CPU vendors are usually close-mouthed about what exactly they're doing
(it's some of their secret sauce), but some are more open than others.
One not-too-old CPU design that has a relatively thorough description of
the branch prediction architecture (though not the nitty-gritty details)
is the AMD Jaguar (low-power x86):
http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip
Section 2.7.1 in the PDF has a (very) brief description of the various
branch prediction mechanisms.
-Fabian
On 8/14/2018 2:09 PM, 2016 quekong via llvm-dev wrote:
> ( I don't know if it's allowed to ask such question, if not, please remind me. )
>
> I know Intel implemented several static branch prediction mechanisms
> these years:
> * 80486 age: Always-not-take
> * Pentium4 age: Backwards Taken/Forwards Not-Taken
> * PM, Core2: Didn't use static prediction, randomly depending on
> what happens to be in corresponding BTB entry , according to agner's
> optimization guide ¹.
> * Newer CPUs like Ivy Bridge, Haswell have become increasingly
> intangible, according to Matt G's experiment ².
>
> And Intel seems don't want to talk about it any more, because the
> latest material I found within Intel Document was written about ten
> years ago.
>
> I know static branch prediction is (far?) less important than dynamic,
> but in quite a few situations, CPU will be completely lost and
> programmers(with compiler) are usually the best guide. Of course these
> situations are usually not performance bottleneck, because once a
> branch is frequently executed, the dynamic predictor will capture it.
>
> Since Intel no longer clearly statements the dynamic prediction
> mechanism in its document, the builtin_expect() of GCC can do nothing
> more than removing the unlikely branch from hot path or reversely for
> likely branch.
>
> I am not familiar with CPU design and I don't know what exactly
> mechanism Intel use nowadays for its static predictor, I just feel the
> best static mechanism for Intel should be to clearly document his CPU
> "where I plan to go when dynamic predictor failed, forward or
> backward", because usually the programmer is the best guide at that
> time.
>
>
> APPENDIX:
> ¹ Agner's optimization guide:
> https://www.agner.org/optimize/microarchitecture.pdf , section 3.5
> .
>
> ² Matt G's experiment: https://xania.org/201602/bpu-part-two
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
More information about the llvm-dev
mailing list