<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <br>

    <br>

    <div class="moz-cite-prefix">On 01/21/2016 01:51 PM, Sean Silva

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAHnXoamJM+rdMmr_+SvP0pOrVo=e_yKoCKEN+5FyrQsyPD1V9Q@mail.gmail.com"

      type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <div class="gmail_quote">On Thu, Jan 21, 2016 at 1:33 PM,

            Philip Reames <span dir="ltr"><<a moz-do-not-send="true"

                href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div text="#000000" bgcolor="#FFFFFF"><span class=""> <br>

                  <br>

                  <div>On 01/19/2016 09:04 PM, Sean Silva via llvm-dev

                    wrote:<br>

                  </div>

                  <blockquote type="cite">

                    <div dir="ltr"><br>

                      <div class="gmail_extra">AFAIK, the cost of a

                        well-predicted, not-taken branch is the same as

                        a nop on every x86 made in the last many years.

                        See <a moz-do-not-send="true"

                          href="http://www.agner.org/optimize/instruction_tables.pdf"

                          target="_blank">http://www.agner.org/optimize/instruction_tables.pdf</a>

                        <div class="gmail_quote">

                          <div>Generally speaking a correctly-predicted

                            not-taken branch is basically identical to a

                            nop, and a correctly-predicted taken branch

                            is has an extra overhead similar to an "add"

                            or other extremely cheap operation. </div>

                        </div>

                      </div>

                    </div>

                  </blockquote>

                </span> Specifically on this point only: While

                absolutely true for most micro-benchmarks, this is less

                true at large scale.  I've definitely seen removing a

                highly predictable branch (in many, many places, some of

                which are hot) to benefit performance in the 5-10%

                range.  For instance, removing highly predictable

                branches is the primary motivation of implicit null

                checking.  (<a moz-do-not-send="true"

                  href="http://llvm.org/docs/FaultMaps.html"

                  target="_blank">http://llvm.org/docs/FaultMaps.html</a>). 

                Where exactly the performance improvement comes from is

                hard to say, but, empirically, it does matter.  <br>

                <br>

                (Caveat to above: I have not run an experiment that

                actually put in the same number of bytes in nops.  It's

                possible the entire benefit I mentioned is code size

                related, but I doubt it given how many ticks a sample

                profiler will show on said branches.)<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>Interesting. Another possible explanation is that these

              extra branches cause contention on branch-prediction

              resources. </div>

          </div>

        </div>

      </div>

    </blockquote>

    I've heard and proposed this explanation in the past as well, but

    I've never heard of anyone able to categorically answer the

    question.  <br>

    <br>

    The other explanation I've considered is that the processor has a

    finite speculation depth (i.e. how many in flight predicted

    branches), and the extra branches cause the processor to not be able

    to speculate "interesting" branches because they're full of

    uninteresting ones.  However, my hardware friends tell me this is a

    somewhat questionable explanation since the check branches should be

    easy to satisfy and retire quickly.  <br>

    <br>

    <blockquote

cite="mid:CAHnXoamJM+rdMmr_+SvP0pOrVo=e_yKoCKEN+5FyrQsyPD1V9Q@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div>In the past when talking with Dan about WebAssembly

              sandboxing, IIRC he said that they found about 15%

              overhead, due primarily to branch-prediction resource

              contention. </div>

          </div>

        </div>

      </div>

    </blockquote>

    15% seems a bit high to me, but I don't have anything concrete to

    share here unfortunately.  <br>

    <blockquote

cite="mid:CAHnXoamJM+rdMmr_+SvP0pOrVo=e_yKoCKEN+5FyrQsyPD1V9Q@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div>In fact I think they had a pretty clear idea of wanting

              a new instruction which is just a "statically predict

              never taken and don't use any branch-prediction resources"

              branch (this is on x86 IIRC; some arches actually

              obviously have such an instruction!).</div>

          </div>

        </div>

      </div>

    </blockquote>

    This has been on my wish list for a while.  It would make many

    things so much easier.<br>

    <br>

    The sickly amusing bit is that x86 has two different forms of this,

    neither of which actually work:<br>

    1) There are prefixes for branches which are supposed to control the

    prediction direction.  My understanding is that code which tried

    using them was so often wrong, that modern chips interpret them as

    nop padding.  We actually use this to produce near arbitrary length

    nops.  :)<br>

    2) x86 (but not x86-64) had a "into" instruction which triggered an

    interrupt if the overflow bit is set.  (Hey, signal handlers are

    just weird branches right? :p)  However, this does not work as

    designed in x86-64.  My understanding is that the original AMD

    implementation had a bug in this instruction and the bug essentially

    got written into the spec for all future chips.  :(<br>

    <br>

    Philip<br>

  </body>

</html>