<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <br>

    <br>

    <div class="moz-cite-prefix">On 01/12/2016 11:16 PM, JF Bastien

      wrote:<br>

    </div>

    <blockquote

cite="mid:CABdywOcm1o4m6mb=d962WoiX_389hbMV5o65sXmvo4jn3wan6w@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>Hello, fencing enthusiasts!</div>

        <div><br>

        </div>

        <b>TL;DR:</b> We'd like to propose an addition to the LLVM

        memory model requiring non-temporal accesses be surrounded by

        non-temporal load barriers and non-temporal store barriers, and

        we'd like to add such orderings to the <font face="monospace,

          monospace">fence</font> IR opcode.

        <div><br>

        </div>

        <div>We are open to different approaches, hence this email

          instead of a patch.<br>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>Who's "we"?</b></div>

          <div><br>

          </div>

          <div>Philip Reames brought this to my attention, and we've had

            numerous discussions with Hans Boehm on the topic. Any

            mistakes below are my own, all the clever bits are theirs.</div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>Why?</b></div>

          <div><br>

          </div>

          <div>Ignore non-temporals for a moment, on most x86 targets

            LLVM generates an <font face="monospace, monospace">mfence</font>

            for <font face="monospace, monospace">seq_cst</font> atomic

            fencing. One could instead use a locked idempotent atomic

            accesses to top-of-stack such as <font face="monospace,

              monospace">lock or4i [RSP-8] 0</font>. Philip has measured

            this as equivalent on micro-benchmarks, but as ~25% faster

            in macro-benchmarks (other codebases confirm this). There's

            one problem with this approach: non-temporal accesses on x86

            are only ordered by fence instructions! This means that code

            using non-temporal accesses can't rely on LLVM's <font

              face="monospace, monospace">fence</font> opcode to do the

            right thing, they instead have to rely on

            architecture-specific <font face="monospace, monospace">_mm*fence</font>

            intrinsics.</div>

        </div>

      </div>

    </blockquote>

    Just for clarify: the proposal to change the implementation of

    ceq_cst is arguable separate from this proposal.  It will go through

    normal patch review once the semantics are addressed.  Whatever we

    end up doing with ceq_cst, we currently have a semantic hole in our

    specification around non-temporals that needs addressed.  <br>

    <br>

    Another approach would be to define the current fences as fencing

    non-temporals and introducing new ones that don't.  Either approach

    is workable.  I believe that new fences for non-temporals are the

    appropriate choice given that would more closely match existing

    practice.  <br>

    <br>

    We could also consider forward serialize bitcode to the stronger

    form whichever choice we made.  That would be conservatively correct

    thing to do for older bitcode which might be assuming strong

    semantics than our barriers explicitly provided.<br>

    <blockquote

cite="mid:CABdywOcm1o4m6mb=d962WoiX_389hbMV5o65sXmvo4jn3wan6w@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>But wait! Who said developers need to issue any type

              of fence when using non-temporals?</b></div>

          <div><br>

          </div>

          <div>Well, the LLVM memory model sure didn't. The x86 memory

            model does (volume 3 section 8.2.2 Memory Ordering) but LLVM

            targets more than x86 and the backends are free to ignore

            the <font face="monospace, monospace">!nontemporal</font>

            metadata, and AFAICT the x86 backend doesn't add those

            fences.</div>

          <div><br>

          </div>

          <div>Therefore even without the above optimization the LLVM

            language reference is incorrect: non-temporals should be

            bracketed by barriers. This applies even without threading!

            Non-temporal accesses aren't guaranteed to interact well

            with regular accesses, which means that regular loads cannot

            move "down" a non-temporal barrier, and regular stores

            cannot move "up" a non-temporal barrier.</div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>Why not just have the compiler add the fences?</b></div>

          <div><br>

          </div>

          <div>LLVM could do this, either as a per-backend thing or a

            hookable pass such as <font face="monospace, monospace">AtomicExpandPass</font>.

            It seems more natural to ask the programmer to express

            intent, just as is done with atomics. In fact, a backend is

            current free to ignore <span

              style="font-family:monospace,monospace">!nontemporal</span> on

            load and store and could therefore generate only half of

            what's requested, leading to incorrect code. That would of

            course be silly, backends should either honor all <span

              style="font-family:monospace,monospace">!nontemporal</span> or

            none of them but who knows what the middle-end does.</div>

          <div><br>

          </div>

          <div>Put another way: some optimized C library use

            non-temporal accesses (when string instructions aren't du

            jour) and they terminate their copying with an <font

              face="monospace, monospace">sfence</font>. It's a de-facto

            convention, the ABI doesn't say anything, but let's avoid

            divergence.</div>

          <div><br>

          </div>

          <div>Aside: one day we may live in <a moz-do-not-send="true"

href="http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html">the

              fence elimination promised land</a> where fences are

            exactly where they need to be, no more, no less.<br>

          </div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>Isn't x86's <font face="monospace, monospace">lfence</font>

              just a no-op?</b></div>

          <div><br>

          </div>

          <div>Yes, but we're proposing the addition of a

            target-independent non-temporal load barrier. It'll be up to

            the x86 backend to make it an <font face="monospace,

              monospace">X86ISD::MEMBARRIER</font> and other backends to

            get it right (hint: it's not always a no-op).</div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>Won't this optimization cause coherency misses? C++

              access the thread stack concurrently all the time!</b></div>

          <div>

            <div><br>

            </div>

            <div>Maybe, but then it isn't much of an optimization if

              it's slowing code down. LLVM doesn't just target C++, and

              it's really up to the backend to decide whether one fence

              type is better than another (on x86, whether a locked

              top-of-stack idempotent operation is better than <font

                face="monospace, monospace">mfence</font>). Other

              languages have private stacks where this isn't an issue,

              and where the stack top can reasonably be assumed to be in

              cache.</div>

          </div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>How will this affect non-user-mode code (i.e. kernel

              code)?</b></div>

          <div><br>

          </div>

          <div>Kernel code still has to ask for _mm_<font

              face="monospace, monospace">mfence</font> if it wants <font

              face="monospace, monospace">mfence</font>: C11 and C++11

            barriers aren't specified as a specific instruction.</div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>Is it safe to access top-of-stack?</b></div>

          <div><br>

          </div>

          <div>AFAIK yes, and the ABI-specified red zone has our back

            (or front if the stack grows up ☻).</div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>What about non-x86 architectures?</b></div>

          <div><br>

          </div>

          <div>Architectures such as ARMv8 support non-temporal

            instructions and require barriers such as <font

              face="monospace, monospace">DMB nshld</font> to order

            loads and <font face="monospace, monospace">DMB nshst</font>

            to order stores.</div>

          <div><br>

          </div>

        </div>

        Even ARM's address-dependency rule (a.k.a. the ill-fated <font

          face="monospace, monospace">std::memory_order_consume</font>)

        fails to hold with non-temporals:<br>

        <blockquote style="margin:0px 0px 0px

          40px;border:none;padding:0px">

          <div>

            <div>

              <div><font face="monospace, monospace">LDR X0, [X3]</font></div>

            </div>

          </div>

          <div>

            <div>

              <div><font face="monospace, monospace">LDNP X2, X1, [X0]

                  // X0 may not be loaded when the instruction executes!</font></div>

            </div>

          </div>

        </blockquote>

        <div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div><b>Who uses non-temporals anyways?</b></div>

          <div><br>

          </div>

          <div>That's an awfully personal question!</div>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>