<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jan 14, 2016 at 1:05 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">----- Original Message -----<br>
> From: "JF Bastien" <<a href="mailto:jfb@google.com">jfb@google.com</a>><br>
> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>
> Cc: "Philip Reames" <<a href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>>, "Hans Boehm" <<a href="mailto:hboehm@google.com">hboehm@google.com</a>>, "llvm-dev"<br>
> <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>
> Sent: Thursday, January 14, 2016 3:02:20 PM<br>
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR<br>
><br>
><br>
><br>
><br>
</span><span class="">> On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>
> wrote:<br>
><br>
><br>
> Hi JF, Philip,<br>
><br>
> Clang currently has __builtin_nontemporal_store and<br>
> __builtin_nontemporal_load. How will the usage model for those<br>
> change?<br>
><br>
><br>
><br>
> I think you would use them in the same way, but you'd have to also<br>
> use __builtin_nontemporal_store_fence and<br>
> __builtin_nontemporal_load_fence.<br>
<br>
</span>So we'll add new fence intrinsics. That makes sense.<br></blockquote><div><br></div><div>Correct, and I propose that this translate to an LLVM IR barrier, with a new type of memory ordering (non-temporal load, and non-temporal store). It can't be metadata, but it could be an attribute instead (akin to how load/store have atomic and volatile attributes).</div><div><br></div><div>We could then add the same concept to C++ but I won't tip my hand too much ;-)</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
> Unless we have LLVM automagically figure out where non-temporal<br>
> fences should go, which I think isn't as good of an approach.<br>
><br>
<br>
</span>I agree. Such a determination is likely to be too conservative in practice.<br></blockquote><div><br></div><div>Indeed, user control seems better here especially when it comes to knowing which memory aliases to know where the fence matters.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="HOEnZb"><font color="#888888">
-Hal<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
><br>
> Thanks again,<br>
> Hal<br>
><br>
> ----- Original Message -----<br>
><br>
> > From: "Philip Reames via llvm-dev" < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> ><br>
> > To: "JF Bastien" < <a href="mailto:jfb@google.com">jfb@google.com</a> >, "llvm-dev"<br>
> > < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> ><br>
> > Cc: "Hans Boehm" < <a href="mailto:hboehm@google.com">hboehm@google.com</a> ><br>
> > Sent: Wednesday, January 13, 2016 11:45:35 AM<br>
> > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR<br>
><br>
> > On 01/12/2016 11:16 PM, JF Bastien wrote:<br>
><br>
> > > Hello, fencing enthusiasts!<br>
> ><br>
><br>
> > > TL;DR: We'd like to propose an addition to the LLVM memory model<br>
> > > requiring non-temporal accesses be surrounded by non-temporal<br>
> > > load<br>
> > > barriers and non-temporal store barriers, and we'd like to add<br>
> > > such<br>
> > > orderings to the fence IR opcode.<br>
> ><br>
><br>
> > > We are open to different approaches, hence this email instead of<br>
> > > a<br>
> > > patch.<br>
> ><br>
><br>
> > > Who's "we"?<br>
> ><br>
><br>
> > > Philip Reames brought this to my attention, and we've had<br>
> > > numerous<br>
> > > discussions with Hans Boehm on the topic. Any mistakes below are<br>
> > > my<br>
> > > own, all the clever bits are theirs.<br>
> ><br>
><br>
> > > Why?<br>
> ><br>
><br>
> > > Ignore non-temporals for a moment, on most x86 targets LLVM<br>
> > > generates<br>
> > > an mfence for seq_cst atomic fencing. One could instead use a<br>
> > > locked<br>
> > > idempotent atomic accesses to top-of-stack such as lock or4i<br>
> > > [RSP-8]<br>
> > > 0 . Philip has measured this as equivalent on micro-benchmarks,<br>
> > > but<br>
><br>
><br>
> > > as ~25% faster in macro-benchmarks (other codebases confirm<br>
> > > this).<br>
> > > There's one problem with this approach: non-temporal accesses on<br>
> > > x86<br>
> > > are only ordered by fence instructions! This means that code<br>
> > > using<br>
> > > non-temporal accesses can't rely on LLVM's fence opcode to do the<br>
> > > right thing, they instead have to rely on architecture-specific<br>
> > > _mm*fence intrinsics.<br>
> ><br>
> > Just for clarify: the proposal to change the implementation of<br>
> > ceq_cst is arguable separate from this proposal. It will go through<br>
> > normal patch review once the semantics are addressed. Whatever we<br>
> > end up doing with ceq_cst, we currently have a semantic hole in our<br>
> > specification around non-temporals that needs addressed.<br>
><br>
> > Another approach would be to define the current fences as fencing<br>
> > non-temporals and introducing new ones that don't. Either approach<br>
> > is workable. I believe that new fences for non-temporals are the<br>
> > appropriate choice given that would more closely match existing<br>
> > practice.<br>
><br>
> > We could also consider forward serialize bitcode to the stronger<br>
> > form<br>
> > whichever choice we made. That would be conservatively correct<br>
> > thing<br>
> > to do for older bitcode which might be assuming strong semantics<br>
> > than our barriers explicitly provided.<br>
><br>
> > > But wait! Who said developers need to issue any type of fence<br>
> > > when<br>
> > > using non-temporals?<br>
> ><br>
><br>
> > > Well, the LLVM memory model sure didn't. The x86 memory model<br>
> > > does<br>
> > > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more<br>
> > > than<br>
> > > x86 and the backends are free to ignore the !nontemporal<br>
> > > metadata,<br>
> > > and AFAICT the x86 backend doesn't add those fences.<br>
> ><br>
><br>
> > > Therefore even without the above optimization the LLVM language<br>
> > > reference is incorrect: non-temporals should be bracketed by<br>
> > > barriers. This applies even without threading! Non-temporal<br>
> > > accesses<br>
> > > aren't guaranteed to interact well with regular accesses, which<br>
> > > means that regular loads cannot move "down" a non-temporal<br>
> > > barrier,<br>
> > > and regular stores cannot move "up" a non-temporal barrier.<br>
> ><br>
><br>
> > > Why not just have the compiler add the fences?<br>
> ><br>
><br>
> > > LLVM could do this, either as a per-backend thing or a hookable<br>
> > > pass<br>
> > > such as AtomicExpandPass . It seems more natural to ask the<br>
> > > programmer to express intent, just as is done with atomics. In<br>
> > > fact,<br>
> > > a backend is current free to ignore !nontemporal on load and<br>
> > > store<br>
> > > and could therefore generate only half of what's requested,<br>
> > > leading<br>
> > > to incorrect code. That would of course be silly, backends should<br>
> > > either honor all !nontemporal or none of them but who knows what<br>
> > > the<br>
> > > middle-end does.<br>
> ><br>
><br>
> > > Put another way: some optimized C library use non-temporal<br>
> > > accesses<br>
> > > (when string instructions aren't du jour) and they terminate<br>
> > > their<br>
> > > copying with an sfence . It's a de-facto convention, the ABI<br>
> > > doesn't<br>
> > > say anything, but let's avoid divergence.<br>
> ><br>
><br>
> > > Aside: one day we may live in the fence elimination promised land<br>
> > > where fences are exactly where they need to be, no more, no less.<br>
> ><br>
><br>
> > > Isn't x86's lfence just a no-op?<br>
> ><br>
><br>
> > > Yes, but we're proposing the addition of a target-independent<br>
> > > non-temporal load barrier. It'll be up to the x86 backend to make<br>
> > > it<br>
> > > an X86ISD::MEMBARRIER and other backends to get it right (hint:<br>
> > > it's<br>
> > > not always a no-op).<br>
> ><br>
><br>
> > > Won't this optimization cause coherency misses? C++ access the<br>
> > > thread<br>
> > > stack concurrently all the time!<br>
> ><br>
><br>
> > > Maybe, but then it isn't much of an optimization if it's slowing<br>
> > > code<br>
> > > down. LLVM doesn't just target C++, and it's really up to the<br>
> > > backend to decide whether one fence type is better than another<br>
> > > (on<br>
> > > x86, whether a locked top-of-stack idempotent operation is better<br>
> > > than mfence ). Other languages have private stacks where this<br>
> > > isn't<br>
> > > an issue, and where the stack top can reasonably be assumed to be<br>
> > > in<br>
> > > cache.<br>
> ><br>
><br>
> > > How will this affect non-user-mode code (i.e. kernel code)?<br>
> ><br>
><br>
> > > Kernel code still has to ask for _mm_ mfence if it wants mfence :<br>
> > > C11<br>
> > > and C++11 barriers aren't specified as a specific instruction.<br>
> ><br>
><br>
> > > Is it safe to access top-of-stack?<br>
> ><br>
><br>
> > > AFAIK yes, and the ABI-specified red zone has our back (or front<br>
> > > if<br>
> > > the stack grows up ☻).<br>
> ><br>
><br>
> > > What about non-x86 architectures?<br>
> ><br>
><br>
> > > Architectures such as ARMv8 support non-temporal instructions and<br>
> > > require barriers such as DMB nshld to order loads and DMB nshst<br>
> > > to<br>
> > > order stores.<br>
> ><br>
><br>
> > > Even ARM's address-dependency rule (a.k.a. the ill-fated<br>
> > > std::memory_order_consume ) fails to hold with non-temporals:<br>
> ><br>
><br>
> > > > LDR X0, [X3]<br>
> > ><br>
> ><br>
><br>
> > > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction<br>
> > > > executes!<br>
> > ><br>
> ><br>
> > > Who uses non-temporals anyways?<br>
> ><br>
><br>
> > > That's an awfully personal question!<br>
> ><br>
> > _______________________________________________<br>
> > LLVM Developers mailing list<br>
> > <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
> > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
><br>
> --<br>
><br>
> --<br>
> Hal Finkel<br>
> Assistant Computational Scientist<br>
> Leadership Computing Facility<br>
> Argonne National Laboratory<br>
><br>
><br>
<br>
--<br>
Hal Finkel<br>
Assistant Computational Scientist<br>
Leadership Computing Facility<br>
Argonne National Laboratory<br>
</div></div></blockquote></div><br></div></div>