[llvm-dev] RFC: non-temporal fencing in LLVM IR

Wed Jan 13 10:44:48 PST 2016

On Wed, Jan 13, 2016 at 10:32 AM, John Brawn <John.Brawn at arm.com> wrote:

> *What about non-x86 architectures?*
>
>
>
> Architectures such as ARMv8 support non-temporal instructions and require
> barriers such as DMB nshld to order loads and DMB nshst to order stores.
>
>
>
> Even ARM's address-dependency rule (a.k.a. the ill-fated
> std::memory_order_consume) fails to hold with non-temporals:
>
> LDR X0, [X3]
>
> LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!
>
>
>
> What exactly do you mean by ‘X0 may not be loaded’ in your example here?
> If you mean that the LDNP
>
> could start executing with the value of X0 from before the LDR,  e.g.
> initially X0=0x100, the LDR loads
>
> X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think
> that’s true. According to
>
> section C3.2.4 of the ARMv8 ARMARM *other* observers may observe the LDR
> and the LDNP in the wrong
>
> order, but the CPU executing the instructions will observe them in program
> order.
>

I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
details for that ISA. I lifted this example from here:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html

Which is correct?

 I have no idea if that affects anything in this RFC though.
>

Agreed, but I don't want to be misleading! The current example serves as a
good justification for non-temporal read barriers, it would be a shame to
justify myself on incorrect data :-)

 John
>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *JF
> Bastien via llvm-dev
> *Sent:* 13 January 2016 07:16
> *To:* llvm-dev
> *Cc:* Hans Boehm
> *Subject:* [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
>
>
> Hello, fencing enthusiasts!
>
>
>
> *TL;DR:* We'd like to propose an addition to the LLVM memory model
> requiring non-temporal accesses be surrounded by non-temporal load barriers
> and non-temporal store barriers, and we'd like to add such orderings to the
> fence IR opcode.
>
>
>
> We are open to different approaches, hence this email instead of a patch.
>
>
>
>
>
> *Who's "we"?*
>
>
>
> Philip Reames brought this to my attention, and we've had numerous
> discussions with Hans Boehm on the topic. Any mistakes below are my own,
> all the clever bits are theirs.
>
>
>
>
>
> *Why?*
>
>
>
> Ignore non-temporals for a moment, on most x86 targets LLVM generates an
> mfence for seq_cst atomic fencing. One could instead use a locked
> idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0.
> Philip has measured this as equivalent on micro-benchmarks, but as ~25%
> faster in macro-benchmarks (other codebases confirm this). There's one
> problem with this approach: non-temporal accesses on x86 are only ordered
> by fence instructions! This means that code using non-temporal accesses
> can't rely on LLVM's fence opcode to do the right thing, they instead
> have to rely on architecture-specific _mm*fence intrinsics.
>
>
>
>
>
> *But wait! Who said developers need to issue any type of fence when using
> non-temporals?*
>
>
>
> Well, the LLVM memory model sure didn't. The x86 memory model does (volume
> 3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the
> backends are free to ignore the !nontemporal metadata, and AFAICT the x86
> backend doesn't add those fences.
>
>
>
> Therefore even without the above optimization the LLVM language reference
> is incorrect: non-temporals should be bracketed by barriers. This applies
> even without threading! Non-temporal accesses aren't guaranteed to interact
> well with regular accesses, which means that regular loads cannot move
> "down" a non-temporal barrier, and regular stores cannot move "up" a
> non-temporal barrier.
>
>
>
>
>
> *Why not just have the compiler add the fences?*
>
>
>
> LLVM could do this, either as a per-backend thing or a hookable pass such
> as AtomicExpandPass. It seems more natural to ask the programmer to
> express intent, just as is done with atomics. In fact, a backend is current
> free to ignore !nontemporal on load and store and could therefore
> generate only half of what's requested, leading to incorrect code. That
> would of course be silly, backends should either honor all !nontemporal or
> none of them but who knows what the middle-end does.
>
>
>
> Put another way: some optimized C library use non-temporal accesses (when
> string instructions aren't du jour) and they terminate their copying with
> an sfence. It's a de-facto convention, the ABI doesn't say anything, but
> let's avoid divergence.
>
>
>
> Aside: one day we may live in the fence elimination promised land
> <http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html> where
> fences are exactly where they need to be, no more, no less.
>
>
>
>
>
> *Isn't x86's **lfence just a no-op?*
>
>
>
> Yes, but we're proposing the addition of a target-independent non-temporal
> load barrier. It'll be up to the x86 backend to make it an
> X86ISD::MEMBARRIER and other backends to get it right (hint: it's not
> always a no-op).
>
>
>
>
>
> *Won't this optimization cause coherency misses? C++ access the thread
> stack concurrently all the time!*
>
>
>
> Maybe, but then it isn't much of an optimization if it's slowing code
> down. LLVM doesn't just target C++, and it's really up to the backend to
> decide whether one fence type is better than another (on x86, whether a
> locked top-of-stack idempotent operation is better than mfence). Other
> languages have private stacks where this isn't an issue, and where the
> stack top can reasonably be assumed to be in cache.
>
>
>
>
>
> *How will this affect non-user-mode code (i.e. kernel code)?*
>
>
>
> Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and
> C++11 barriers aren't specified as a specific instruction.
>
>
>
>
>
> *Is it safe to access top-of-stack?*
>
>
>
> AFAIK yes, and the ABI-specified red zone has our back (or front if the
> stack grows up ☻).
>
>
>
>
>
> *What about non-x86 architectures?*
>
>
>
> Architectures such as ARMv8 support non-temporal instructions and require
> barriers such as DMB nshld to order loads and DMB nshst to order stores.
>
>
>
> Even ARM's address-dependency rule (a.k.a. the ill-fated
> std::memory_order_consume) fails to hold with non-temporals:
>
> LDR X0, [X3]
>
> LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!
>
>
>
>
>
> *Who uses non-temporals anyways?*
>
>
>
> That's an awfully personal question!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160113/e6f64687/attachment.html>