<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi JF, Philip,<br>

<br>

Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change?<br></blockquote><div><br></div><div>I think you would use them in the same way, but you'd have to also use __builtin_nontemporal_store_fence and __builtin_nontemporal_load_fence. </div><div><br></div><div>Unless we have LLVM automagically figure out where non-temporal fences should go, which I think isn't as good of an approach.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Thanks again,<br>

Hal<br>

<br>

----- Original Message -----<br>

<br>

> From: "Philip Reames via llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

> To: "JF Bastien" <<a href="mailto:jfb@google.com">jfb@google.com</a>>, "llvm-dev"<br>

> <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

> Cc: "Hans Boehm" <<a href="mailto:hboehm@google.com">hboehm@google.com</a>><br>

> Sent: Wednesday, January 13, 2016 11:45:35 AM<br>

> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR<br>

<span class=""><br>

> On 01/12/2016 11:16 PM, JF Bastien wrote:<br>

<br>

> > Hello, fencing enthusiasts!<br>

><br>

<br>

> > TL;DR: We'd like to propose an addition to the LLVM memory model<br>

> > requiring non-temporal accesses be surrounded by non-temporal load<br>

> > barriers and non-temporal store barriers, and we'd like to add such<br>

> > orderings to the fence IR opcode.<br>

><br>

<br>

> > We are open to different approaches, hence this email instead of a<br>

> > patch.<br>

><br>

<br>

> > Who's "we"?<br>

><br>

<br>

> > Philip Reames brought this to my attention, and we've had numerous<br>

> > discussions with Hans Boehm on the topic. Any mistakes below are my<br>

> > own, all the clever bits are theirs.<br>

><br>

<br>

> > Why?<br>

><br>

<br>

> > Ignore non-temporals for a moment, on most x86 targets LLVM<br>

> > generates<br>

> > an mfence for seq_cst atomic fencing. One could instead use a<br>

> > locked<br>

> > idempotent atomic accesses to top-of-stack such as lock or4i<br>

> > [RSP-8]<br>

</span>> > 0 . Philip has measured this as equivalent on micro-benchmarks, but<br>

<div><div class="h5">> > as ~25% faster in macro-benchmarks (other codebases confirm this).<br>

> > There's one problem with this approach: non-temporal accesses on<br>

> > x86<br>

> > are only ordered by fence instructions! This means that code using<br>

> > non-temporal accesses can't rely on LLVM's fence opcode to do the<br>

> > right thing, they instead have to rely on architecture-specific<br>

> > _mm*fence intrinsics.<br>

><br>

> Just for clarify: the proposal to change the implementation of<br>

> ceq_cst is arguable separate from this proposal. It will go through<br>

> normal patch review once the semantics are addressed. Whatever we<br>

> end up doing with ceq_cst, we currently have a semantic hole in our<br>

> specification around non-temporals that needs addressed.<br>

<br>

> Another approach would be to define the current fences as fencing<br>

> non-temporals and introducing new ones that don't. Either approach<br>

> is workable. I believe that new fences for non-temporals are the<br>

> appropriate choice given that would more closely match existing<br>

> practice.<br>

<br>

> We could also consider forward serialize bitcode to the stronger form<br>

> whichever choice we made. That would be conservatively correct thing<br>

> to do for older bitcode which might be assuming strong semantics<br>

> than our barriers explicitly provided.<br>

<br>

> > But wait! Who said developers need to issue any type of fence when<br>

> > using non-temporals?<br>

><br>

<br>

> > Well, the LLVM memory model sure didn't. The x86 memory model does<br>

> > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than<br>

> > x86 and the backends are free to ignore the !nontemporal metadata,<br>

> > and AFAICT the x86 backend doesn't add those fences.<br>

><br>

<br>

> > Therefore even without the above optimization the LLVM language<br>

> > reference is incorrect: non-temporals should be bracketed by<br>

> > barriers. This applies even without threading! Non-temporal<br>

> > accesses<br>

> > aren't guaranteed to interact well with regular accesses, which<br>

> > means that regular loads cannot move "down" a non-temporal barrier,<br>

> > and regular stores cannot move "up" a non-temporal barrier.<br>

><br>

<br>

> > Why not just have the compiler add the fences?<br>

><br>

<br>

> > LLVM could do this, either as a per-backend thing or a hookable<br>

> > pass<br>

</div></div>> > such as AtomicExpandPass . It seems more natural to ask the<br>

<span class="">> > programmer to express intent, just as is done with atomics. In<br>

> > fact,<br>

> > a backend is current free to ignore !nontemporal on load and store<br>

> > and could therefore generate only half of what's requested, leading<br>

> > to incorrect code. That would of course be silly, backends should<br>

> > either honor all !nontemporal or none of them but who knows what<br>

> > the<br>

> > middle-end does.<br>

><br>

<br>

> > Put another way: some optimized C library use non-temporal accesses<br>

> > (when string instructions aren't du jour) and they terminate their<br>

</span>> > copying with an sfence . It's a de-facto convention, the ABI<br>

<span class="">> > doesn't<br>

> > say anything, but let's avoid divergence.<br>

><br>

<br>

> > Aside: one day we may live in the fence elimination promised land<br>

> > where fences are exactly where they need to be, no more, no less.<br>

><br>

<br>

> > Isn't x86's lfence just a no-op?<br>

><br>

<br>

> > Yes, but we're proposing the addition of a target-independent<br>

> > non-temporal load barrier. It'll be up to the x86 backend to make<br>

> > it<br>

> > an X86ISD::MEMBARRIER and other backends to get it right (hint:<br>

> > it's<br>

> > not always a no-op).<br>

><br>

<br>

> > Won't this optimization cause coherency misses? C++ access the<br>

> > thread<br>

> > stack concurrently all the time!<br>

><br>

<br>

> > Maybe, but then it isn't much of an optimization if it's slowing<br>

> > code<br>

> > down. LLVM doesn't just target C++, and it's really up to the<br>

> > backend to decide whether one fence type is better than another (on<br>

> > x86, whether a locked top-of-stack idempotent operation is better<br>

</span>> > than mfence ). Other languages have private stacks where this isn't<br>

<span class="">> > an issue, and where the stack top can reasonably be assumed to be<br>

> > in<br>

> > cache.<br>

><br>

<br>

> > How will this affect non-user-mode code (i.e. kernel code)?<br>

><br>

<br>

</span>> > Kernel code still has to ask for _mm_ mfence if it wants mfence :<br>

<span class="">> > C11<br>

> > and C++11 barriers aren't specified as a specific instruction.<br>

><br>

<br>

> > Is it safe to access top-of-stack?<br>

><br>

<br>

> > AFAIK yes, and the ABI-specified red zone has our back (or front if<br>

> > the stack grows up ☻).<br>

><br>

<br>

> > What about non-x86 architectures?<br>

><br>

<br>

> > Architectures such as ARMv8 support non-temporal instructions and<br>

> > require barriers such as DMB nshld to order loads and DMB nshst to<br>

> > order stores.<br>

><br>

<br>

> > Even ARM's address-dependency rule (a.k.a. the ill-fated<br>

</span>> > std::memory_order_consume ) fails to hold with non-temporals:<br>

<span class="">><br>

<br>

> > > LDR X0, [X3]<br>

> ><br>

><br>

<br>

> > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction<br>

> > > executes!<br>

> ><br>

><br>

> > Who uses non-temporals anyways?<br>

><br>

<br>

> > That's an awfully personal question!<br>

><br>

</span>> _______________________________________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

<span class=""><font color="#888888"><br>

--<br>

<br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</font></span></blockquote></div><br></div></div>