<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 21, 2016 at 1:33 PM, Philip Reames <span dir="ltr"><<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div text="#000000" bgcolor="#FFFFFF"><span class="">

    <br>

    <br>

    <div>On 01/19/2016 09:04 PM, Sean Silva via

      llvm-dev wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra">AFAIK, the cost of a well-predicted,

          not-taken branch is the same as a nop on every x86 made in the

          last many years.

          See <a href="http://www.agner.org/optimize/instruction_tables.pdf" target="_blank">http://www.agner.org/optimize/instruction_tables.pdf</a>

          <div class="gmail_quote"><a href="http://www.agner.org/optimize/instruction_tables.pdf" target="_blank"></a>

            <div>Generally speaking a correctly-predicted not-taken

              branch is basically identical to a nop, and a

              correctly-predicted taken branch is has an extra overhead

              similar to an "add" or other extremely cheap operation. </div>

          </div>

        </div>

      </div>

    </blockquote></span>

    Specifically on this point only: While absolutely true for most

    micro-benchmarks, this is less true at large scale.  I've definitely

    seen removing a highly predictable branch (in many, many places,

    some of which are hot) to benefit performance in the 5-10% range. 

    For instance, removing highly predictable branches is the primary

    motivation of implicit null checking. 

    (<a href="http://llvm.org/docs/FaultMaps.html" target="_blank">http://llvm.org/docs/FaultMaps.html</a>).  Where exactly the

    performance improvement comes from is hard to say, but, empirically,

    it does matter.  <br>

    <br>

    (Caveat to above: I have not run an experiment that actually put in

    the same number of bytes in nops.  It's possible the entire benefit

    I mentioned is code size related, but I doubt it given how many

    ticks a sample profiler will show on said branches.)<br></div></blockquote><div><br></div><div>Interesting. Another possible explanation is that these extra branches cause contention on branch-prediction resources. In the past when talking with Dan about WebAssembly sandboxing, IIRC he said that they found about 15% overhead, due primarily to branch-prediction resource contention. In fact I think they had a pretty clear idea of wanting a new instruction which is just a "statically predict never taken and don't use any branch-prediction resources" branch (this is on x86 IIRC; some arches actually obviously have such an instruction!).</div><div><br></div><div>-- Sean Silva</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">

    <br>

    p.s. Sean mentions down-thread that most of the slowdown from checks

    is in the effect on the optimizer, not the direct impact of the

    instructions emitted.  This is absolutely our experience as well.  I

    don't intend for anything I said above to imply otherwise.  <br>

    <br>

    Philip<br>

    <br>

  </div>

</blockquote></div><br></div></div>