<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<br>
<br>
<div class="moz-cite-prefix">On 01/21/2016 01:51 PM, Sean Silva
wrote:<br>
</div>
<blockquote
cite="mid:CAHnXoamJM+rdMmr_+SvP0pOrVo=e_yKoCKEN+5FyrQsyPD1V9Q@mail.gmail.com"
type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, Jan 21, 2016 at 1:33 PM,
Philip Reames <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span class=""> <br>
<br>
<div>On 01/19/2016 09:04 PM, Sean Silva via llvm-dev
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra">AFAIK, the cost of a
well-predicted, not-taken branch is the same as
a nop on every x86 made in the last many years.
See <a moz-do-not-send="true"
href="http://www.agner.org/optimize/instruction_tables.pdf"
target="_blank">http://www.agner.org/optimize/instruction_tables.pdf</a>
<div class="gmail_quote">
<div>Generally speaking a correctly-predicted
not-taken branch is basically identical to a
nop, and a correctly-predicted taken branch
is has an extra overhead similar to an "add"
or other extremely cheap operation. </div>
</div>
</div>
</div>
</blockquote>
</span> Specifically on this point only: While
absolutely true for most micro-benchmarks, this is less
true at large scale. I've definitely seen removing a
highly predictable branch (in many, many places, some of
which are hot) to benefit performance in the 5-10%
range. For instance, removing highly predictable
branches is the primary motivation of implicit null
checking. (<a moz-do-not-send="true"
href="http://llvm.org/docs/FaultMaps.html"
target="_blank">http://llvm.org/docs/FaultMaps.html</a>).
Where exactly the performance improvement comes from is
hard to say, but, empirically, it does matter. <br>
<br>
(Caveat to above: I have not run an experiment that
actually put in the same number of bytes in nops. It's
possible the entire benefit I mentioned is code size
related, but I doubt it given how many ticks a sample
profiler will show on said branches.)<br>
</div>
</blockquote>
<div><br>
</div>
<div>Interesting. Another possible explanation is that these
extra branches cause contention on branch-prediction
resources. </div>
</div>
</div>
</div>
</blockquote>
I've heard and proposed this explanation in the past as well, but
I've never heard of anyone able to categorically answer the
question. <br>
<br>
The other explanation I've considered is that the processor has a
finite speculation depth (i.e. how many in flight predicted
branches), and the extra branches cause the processor to not be able
to speculate "interesting" branches because they're full of
uninteresting ones. However, my hardware friends tell me this is a
somewhat questionable explanation since the check branches should be
easy to satisfy and retire quickly. <br>
<br>
<blockquote
cite="mid:CAHnXoamJM+rdMmr_+SvP0pOrVo=e_yKoCKEN+5FyrQsyPD1V9Q@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>In the past when talking with Dan about WebAssembly
sandboxing, IIRC he said that they found about 15%
overhead, due primarily to branch-prediction resource
contention. </div>
</div>
</div>
</div>
</blockquote>
15% seems a bit high to me, but I don't have anything concrete to
share here unfortunately. <br>
<blockquote
cite="mid:CAHnXoamJM+rdMmr_+SvP0pOrVo=e_yKoCKEN+5FyrQsyPD1V9Q@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>In fact I think they had a pretty clear idea of wanting
a new instruction which is just a "statically predict
never taken and don't use any branch-prediction resources"
branch (this is on x86 IIRC; some arches actually
obviously have such an instruction!).</div>
</div>
</div>
</div>
</blockquote>
This has been on my wish list for a while. It would make many
things so much easier.<br>
<br>
The sickly amusing bit is that x86 has two different forms of this,
neither of which actually work:<br>
1) There are prefixes for branches which are supposed to control the
prediction direction. My understanding is that code which tried
using them was so often wrong, that modern chips interpret them as
nop padding. We actually use this to produce near arbitrary length
nops. :)<br>
2) x86 (but not x86-64) had a "into" instruction which triggered an
interrupt if the overflow bit is set. (Hey, signal handlers are
just weird branches right? :p) However, this does not work as
designed in x86-64. My understanding is that the original AMD
implementation had a bug in this instruction and the bug essentially
got written into the spec for all future chips. :(<br>
<br>
Philip<br>
</body>
</html>