[llvm-dev] RFC: non-temporal fencing in LLVM IR

Tue Jan 12 23:16:24 PST 2016

Hello, fencing enthusiasts!

*TL;DR:* We'd like to propose an addition to the LLVM memory model
requiring non-temporal accesses be surrounded by non-temporal load barriers
and non-temporal store barriers, and we'd like to add such orderings to the
fence IR opcode.

We are open to different approaches, hence this email instead of a patch.

*Who's "we"?*

Philip Reames brought this to my attention, and we've had numerous
discussions with Hans Boehm on the topic. Any mistakes below are my own,
all the clever bits are theirs.

*Why?*

Ignore non-temporals for a moment, on most x86 targets LLVM generates an
mfence for seq_cst atomic fencing. One could instead use a locked
idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0.
Philip has measured this as equivalent on micro-benchmarks, but as ~25%
faster in macro-benchmarks (other codebases confirm this). There's one
problem with this approach: non-temporal accesses on x86 are only ordered
by fence instructions! This means that code using non-temporal accesses
can't rely on LLVM's fence opcode to do the right thing, they instead have
to rely on architecture-specific _mm*fence intrinsics.

*But wait! Who said developers need to issue any type of fence when using
non-temporals?*

Well, the LLVM memory model sure didn't. The x86 memory model does (volume
3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the
backends are free to ignore the !nontemporal metadata, and AFAICT the x86
backend doesn't add those fences.

Therefore even without the above optimization the LLVM language reference
is incorrect: non-temporals should be bracketed by barriers. This applies
even without threading! Non-temporal accesses aren't guaranteed to interact
well with regular accesses, which means that regular loads cannot move
"down" a non-temporal barrier, and regular stores cannot move "up" a
non-temporal barrier.

*Why not just have the compiler add the fences?*

LLVM could do this, either as a per-backend thing or a hookable pass such
as AtomicExpandPass. It seems more natural to ask the programmer to express
intent, just as is done with atomics. In fact, a backend is current free to
ignore !nontemporal on load and store and could therefore generate only
half of what's requested, leading to incorrect code. That would of course
be silly, backends should either honor all !nontemporal or none of them but
who knows what the middle-end does.

Put another way: some optimized C library use non-temporal accesses (when
string instructions aren't du jour) and they terminate their copying with
an sfence. It's a de-facto convention, the ABI doesn't say anything, but
let's avoid divergence.

Aside: one day we may live in the fence elimination promised land
<http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html> where
fences are exactly where they need to be, no more, no less.

*Isn't x86's lfence just a no-op?*

Yes, but we're proposing the addition of a target-independent non-temporal
load barrier. It'll be up to the x86 backend to make it an
X86ISD::MEMBARRIER and other backends to get it right (hint: it's not
always a no-op).

*Won't this optimization cause coherency misses? C++ access the thread
stack concurrently all the time!*

Maybe, but then it isn't much of an optimization if it's slowing code down.
LLVM doesn't just target C++, and it's really up to the backend to decide
whether one fence type is better than another (on x86, whether a locked
top-of-stack idempotent operation is better than mfence). Other languages
have private stacks where this isn't an issue, and where the stack top can
reasonably be assumed to be in cache.

*How will this affect non-user-mode code (i.e. kernel code)?*

Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and
C++11 barriers aren't specified as a specific instruction.

*Is it safe to access top-of-stack?*

AFAIK yes, and the ABI-specified red zone has our back (or front if the
stack grows up ☻).

*What about non-x86 architectures?*

Architectures such as ARMv8 support non-temporal instructions and require
barriers such as DMB nshld to order loads and DMB nshst to order stores.

Even ARM's address-dependency rule (a.k.a. the ill-fated
std::memory_order_consume) fails to hold with non-temporals:

LDR X0, [X3]
LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!

*Who uses non-temporals anyways?*

That's an awfully personal question!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160112/d6d189d4/attachment.html>