<div dir="ltr"><div>Hello, fencing enthusiasts!</div><div><br></div><b>TL;DR:</b> We'd like to propose an addition to the LLVM memory model requiring non-temporal accesses be surrounded by non-temporal load barriers and non-temporal store barriers, and we'd like to add such orderings to the <font face="monospace, monospace">fence</font> IR opcode.<div><br></div><div>We are open to different approaches, hence this email instead of a patch.<br><div><br></div><div><br></div><div><b>Who's "we"?</b></div><div><br></div><div>Philip Reames brought this to my attention, and we've had numerous discussions with Hans Boehm on the topic. Any mistakes below are my own, all the clever bits are theirs.</div><div><br></div><div><br></div><div><b>Why?</b></div><div><br></div><div>Ignore non-temporals for a moment, on most x86 targets LLVM generates an <font face="monospace, monospace">mfence</font> for <font face="monospace, monospace">seq_cst</font> atomic fencing. One could instead use a locked idempotent atomic accesses to top-of-stack such as <font face="monospace, monospace">lock or4i [RSP-8] 0</font>. Philip has measured this as equivalent on micro-benchmarks, but as ~25% faster in macro-benchmarks (other codebases confirm this). There's one problem with this approach: non-temporal accesses on x86 are only ordered by fence instructions! This means that code using non-temporal accesses can't rely on LLVM's <font face="monospace, monospace">fence</font> opcode to do the right thing, they instead have to rely on architecture-specific <font face="monospace, monospace">_mm*fence</font> intrinsics.</div><div><br></div><div><br></div><div><b>But wait! Who said developers need to issue any type of fence when using non-temporals?</b></div><div><br></div><div>Well, the LLVM memory model sure didn't. The x86 memory model does (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the backends are free to ignore the <font face="monospace, monospace">!nontemporal</font> metadata, and AFAICT the x86 backend doesn't add those fences.</div><div><br></div><div>Therefore even without the above optimization the LLVM language reference is incorrect: non-temporals should be bracketed by barriers. This applies even without threading! Non-temporal accesses aren't guaranteed to interact well with regular accesses, which means that regular loads cannot move "down" a non-temporal barrier, and regular stores cannot move "up" a non-temporal barrier.</div><div><br></div><div><br></div><div><b>Why not just have the compiler add the fences?</b></div><div><br></div><div>LLVM could do this, either as a per-backend thing or a hookable pass such as <font face="monospace, monospace">AtomicExpandPass</font>. It seems more natural to ask the programmer to express intent, just as is done with atomics. In fact, a backend is current free to ignore <span style="font-family:monospace,monospace">!nontemporal</span> on load and store and could therefore generate only half of what's requested, leading to incorrect code. That would of course be silly, backends should either honor all <span style="font-family:monospace,monospace">!nontemporal</span> or none of them but who knows what the middle-end does.</div><div><br></div><div>Put another way: some optimized C library use non-temporal accesses (when string instructions aren't du jour) and they terminate their copying with an <font face="monospace, monospace">sfence</font>. It's a de-facto convention, the ABI doesn't say anything, but let's avoid divergence.</div><div><br></div><div>Aside: one day we may live in <a href="http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html">the fence elimination promised land</a> where fences are exactly where they need to be, no more, no less.<br></div><div><br></div><div><br></div><div><b>Isn't x86's <font face="monospace, monospace">lfence</font> just a no-op?</b></div><div><br></div><div>Yes, but we're proposing the addition of a target-independent non-temporal load barrier. It'll be up to the x86 backend to make it an <font face="monospace, monospace">X86ISD::MEMBARRIER</font> and other backends to get it right (hint: it's not always a no-op).</div><div><br></div><div><br></div><div><b>Won't this optimization cause coherency misses? C++ access the thread stack concurrently all the time!</b></div><div><div><br></div><div>Maybe, but then it isn't much of an optimization if it's slowing code down. LLVM doesn't just target C++, and it's really up to the backend to decide whether one fence type is better than another (on x86, whether a locked top-of-stack idempotent operation is better than <font face="monospace, monospace">mfence</font>). Other languages have private stacks where this isn't an issue, and where the stack top can reasonably be assumed to be in cache.</div></div><div><br></div><div><br></div><div><b>How will this affect non-user-mode code (i.e. kernel code)?</b></div><div><br></div><div>Kernel code still has to ask for _mm_<font face="monospace, monospace">mfence</font> if it wants <font face="monospace, monospace">mfence</font>: C11 and C++11 barriers aren't specified as a specific instruction.</div><div><br></div><div><br></div><div><b>Is it safe to access top-of-stack?</b></div><div><br></div><div>AFAIK yes, and the ABI-specified red zone has our back (or front if the stack grows up ☻).</div><div><br></div><div><br></div><div><b>What about non-x86 architectures?</b></div><div><br></div><div>Architectures such as ARMv8 support non-temporal instructions and require barriers such as <font face="monospace, monospace">DMB nshld</font> to order loads and <font face="monospace, monospace">DMB nshst</font> to order stores.</div><div><br></div></div>Even ARM's address-dependency rule (a.k.a. the ill-fated <font face="monospace, monospace">std::memory_order_consume</font>) fails to hold with non-temporals:<br><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div><div><font face="monospace, monospace">LDR X0, [X3]</font></div></div></div><div><div><div><font face="monospace, monospace">LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!</font></div></div></div></blockquote><div><div><br></div><div><br></div><div><b>Who uses non-temporals anyways?</b></div><div><br></div><div>That's an awfully personal question!</div></div></div>