<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jan 14, 2016 at 1:05 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">----- Original Message -----<br>

> From: "JF Bastien" <<a href="mailto:jfb@google.com">jfb@google.com</a>><br>

> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>

> Cc: "Philip Reames" <<a href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>>, "Hans Boehm" <<a href="mailto:hboehm@google.com">hboehm@google.com</a>>, "llvm-dev"<br>

> <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

> Sent: Thursday, January 14, 2016 3:02:20 PM<br>

> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR<br>

><br>

><br>

><br>

><br>

</span><span class="">> On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>

> wrote:<br>

><br>

><br>

> Hi JF, Philip,<br>

><br>

> Clang currently has __builtin_nontemporal_store and<br>

> __builtin_nontemporal_load. How will the usage model for those<br>

> change?<br>

><br>

><br>

><br>

> I think you would use them in the same way, but you'd have to also<br>

> use __builtin_nontemporal_store_fence and<br>

> __builtin_nontemporal_load_fence.<br>

<br>

</span>So we'll add new fence intrinsics. That makes sense.<br></blockquote><div><br></div><div>Correct, and I propose that this translate to an LLVM IR barrier, with a new type of memory ordering (non-temporal load, and non-temporal store). It can't be metadata, but it could be an attribute instead (akin to how load/store have atomic and volatile attributes).</div><div><br></div><div>We could then add the same concept to C++ but I won't tip my hand too much ;-)</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

> Unless we have LLVM automagically figure out where non-temporal<br>

> fences should go, which I think isn't as good of an approach.<br>

><br>

<br>

</span>I agree. Such a determination is likely to be too conservative in practice.<br></blockquote><div><br></div><div>Indeed, user control seems better here especially when it comes to knowing which memory aliases to know where the fence matters.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="HOEnZb"><font color="#888888">

 -Hal<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

><br>

> Thanks again,<br>

> Hal<br>

><br>

> ----- Original Message -----<br>

><br>

> > From: "Philip Reames via llvm-dev" < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> ><br>

> > To: "JF Bastien" < <a href="mailto:jfb@google.com">jfb@google.com</a> >, "llvm-dev"<br>

> > < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> ><br>

> > Cc: "Hans Boehm" < <a href="mailto:hboehm@google.com">hboehm@google.com</a> ><br>

> > Sent: Wednesday, January 13, 2016 11:45:35 AM<br>

> > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR<br>

><br>

> > On 01/12/2016 11:16 PM, JF Bastien wrote:<br>

><br>

> > > Hello, fencing enthusiasts!<br>

> ><br>

><br>

> > > TL;DR: We'd like to propose an addition to the LLVM memory model<br>

> > > requiring non-temporal accesses be surrounded by non-temporal<br>

> > > load<br>

> > > barriers and non-temporal store barriers, and we'd like to add<br>

> > > such<br>

> > > orderings to the fence IR opcode.<br>

> ><br>

><br>

> > > We are open to different approaches, hence this email instead of<br>

> > > a<br>

> > > patch.<br>

> ><br>

><br>

> > > Who's "we"?<br>

> ><br>

><br>

> > > Philip Reames brought this to my attention, and we've had<br>

> > > numerous<br>

> > > discussions with Hans Boehm on the topic. Any mistakes below are<br>

> > > my<br>

> > > own, all the clever bits are theirs.<br>

> ><br>

><br>

> > > Why?<br>

> ><br>

><br>

> > > Ignore non-temporals for a moment, on most x86 targets LLVM<br>

> > > generates<br>

> > > an mfence for seq_cst atomic fencing. One could instead use a<br>

> > > locked<br>

> > > idempotent atomic accesses to top-of-stack such as lock or4i<br>

> > > [RSP-8]<br>

> > > 0 . Philip has measured this as equivalent on micro-benchmarks,<br>

> > > but<br>

><br>

><br>

> > > as ~25% faster in macro-benchmarks (other codebases confirm<br>

> > > this).<br>

> > > There's one problem with this approach: non-temporal accesses on<br>

> > > x86<br>

> > > are only ordered by fence instructions! This means that code<br>

> > > using<br>

> > > non-temporal accesses can't rely on LLVM's fence opcode to do the<br>

> > > right thing, they instead have to rely on architecture-specific<br>

> > > _mm*fence intrinsics.<br>

> ><br>

> > Just for clarify: the proposal to change the implementation of<br>

> > ceq_cst is arguable separate from this proposal. It will go through<br>

> > normal patch review once the semantics are addressed. Whatever we<br>

> > end up doing with ceq_cst, we currently have a semantic hole in our<br>

> > specification around non-temporals that needs addressed.<br>

><br>

> > Another approach would be to define the current fences as fencing<br>

> > non-temporals and introducing new ones that don't. Either approach<br>

> > is workable. I believe that new fences for non-temporals are the<br>

> > appropriate choice given that would more closely match existing<br>

> > practice.<br>

><br>

> > We could also consider forward serialize bitcode to the stronger<br>

> > form<br>

> > whichever choice we made. That would be conservatively correct<br>

> > thing<br>

> > to do for older bitcode which might be assuming strong semantics<br>

> > than our barriers explicitly provided.<br>

><br>

> > > But wait! Who said developers need to issue any type of fence<br>

> > > when<br>

> > > using non-temporals?<br>

> ><br>

><br>

> > > Well, the LLVM memory model sure didn't. The x86 memory model<br>

> > > does<br>

> > > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more<br>

> > > than<br>

> > > x86 and the backends are free to ignore the !nontemporal<br>

> > > metadata,<br>

> > > and AFAICT the x86 backend doesn't add those fences.<br>

> ><br>

><br>

> > > Therefore even without the above optimization the LLVM language<br>

> > > reference is incorrect: non-temporals should be bracketed by<br>

> > > barriers. This applies even without threading! Non-temporal<br>

> > > accesses<br>

> > > aren't guaranteed to interact well with regular accesses, which<br>

> > > means that regular loads cannot move "down" a non-temporal<br>

> > > barrier,<br>

> > > and regular stores cannot move "up" a non-temporal barrier.<br>

> ><br>

><br>

> > > Why not just have the compiler add the fences?<br>

> ><br>

><br>

> > > LLVM could do this, either as a per-backend thing or a hookable<br>

> > > pass<br>

> > > such as AtomicExpandPass . It seems more natural to ask the<br>

> > > programmer to express intent, just as is done with atomics. In<br>

> > > fact,<br>

> > > a backend is current free to ignore !nontemporal on load and<br>

> > > store<br>

> > > and could therefore generate only half of what's requested,<br>

> > > leading<br>

> > > to incorrect code. That would of course be silly, backends should<br>

> > > either honor all !nontemporal or none of them but who knows what<br>

> > > the<br>

> > > middle-end does.<br>

> ><br>

><br>

> > > Put another way: some optimized C library use non-temporal<br>

> > > accesses<br>

> > > (when string instructions aren't du jour) and they terminate<br>

> > > their<br>

> > > copying with an sfence . It's a de-facto convention, the ABI<br>

> > > doesn't<br>

> > > say anything, but let's avoid divergence.<br>

> ><br>

><br>

> > > Aside: one day we may live in the fence elimination promised land<br>

> > > where fences are exactly where they need to be, no more, no less.<br>

> ><br>

><br>

> > > Isn't x86's lfence just a no-op?<br>

> ><br>

><br>

> > > Yes, but we're proposing the addition of a target-independent<br>

> > > non-temporal load barrier. It'll be up to the x86 backend to make<br>

> > > it<br>

> > > an X86ISD::MEMBARRIER and other backends to get it right (hint:<br>

> > > it's<br>

> > > not always a no-op).<br>

> ><br>

><br>

> > > Won't this optimization cause coherency misses? C++ access the<br>

> > > thread<br>

> > > stack concurrently all the time!<br>

> ><br>

><br>

> > > Maybe, but then it isn't much of an optimization if it's slowing<br>

> > > code<br>

> > > down. LLVM doesn't just target C++, and it's really up to the<br>

> > > backend to decide whether one fence type is better than another<br>

> > > (on<br>

> > > x86, whether a locked top-of-stack idempotent operation is better<br>

> > > than mfence ). Other languages have private stacks where this<br>

> > > isn't<br>

> > > an issue, and where the stack top can reasonably be assumed to be<br>

> > > in<br>

> > > cache.<br>

> ><br>

><br>

> > > How will this affect non-user-mode code (i.e. kernel code)?<br>

> ><br>

><br>

> > > Kernel code still has to ask for _mm_ mfence if it wants mfence :<br>

> > > C11<br>

> > > and C++11 barriers aren't specified as a specific instruction.<br>

> ><br>

><br>

> > > Is it safe to access top-of-stack?<br>

> ><br>

><br>

> > > AFAIK yes, and the ABI-specified red zone has our back (or front<br>

> > > if<br>

> > > the stack grows up ☻).<br>

> ><br>

><br>

> > > What about non-x86 architectures?<br>

> ><br>

><br>

> > > Architectures such as ARMv8 support non-temporal instructions and<br>

> > > require barriers such as DMB nshld to order loads and DMB nshst<br>

> > > to<br>

> > > order stores.<br>

> ><br>

><br>

> > > Even ARM's address-dependency rule (a.k.a. the ill-fated<br>

> > > std::memory_order_consume ) fails to hold with non-temporals:<br>

> ><br>

><br>

> > > > LDR X0, [X3]<br>

> > ><br>

> ><br>

><br>

> > > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction<br>

> > > > executes!<br>

> > ><br>

> ><br>

> > > Who uses non-temporals anyways?<br>

> ><br>

><br>

> > > That's an awfully personal question!<br>

> ><br>

> > _______________________________________________<br>

> > LLVM Developers mailing list<br>

> > <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

><br>

> --<br>

><br>

> --<br>

> Hal Finkel<br>

> Assistant Computational Scientist<br>

> Leadership Computing Facility<br>

> Argonne National Laboratory<br>

><br>

><br>

<br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</div></div></blockquote></div><br></div></div>