<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<br>
<br>
<div class="moz-cite-prefix">On 01/12/2016 11:16 PM, JF Bastien
wrote:<br>
</div>
<blockquote
cite="mid:CABdywOcm1o4m6mb=d962WoiX_389hbMV5o65sXmvo4jn3wan6w@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Hello, fencing enthusiasts!</div>
<div><br>
</div>
<b>TL;DR:</b> We'd like to propose an addition to the LLVM
memory model requiring non-temporal accesses be surrounded by
non-temporal load barriers and non-temporal store barriers, and
we'd like to add such orderings to the <font face="monospace,
monospace">fence</font> IR opcode.
<div><br>
</div>
<div>We are open to different approaches, hence this email
instead of a patch.<br>
<div><br>
</div>
<div><br>
</div>
<div><b>Who's "we"?</b></div>
<div><br>
</div>
<div>Philip Reames brought this to my attention, and we've had
numerous discussions with Hans Boehm on the topic. Any
mistakes below are my own, all the clever bits are theirs.</div>
<div><br>
</div>
<div><br>
</div>
<div><b>Why?</b></div>
<div><br>
</div>
<div>Ignore non-temporals for a moment, on most x86 targets
LLVM generates an <font face="monospace, monospace">mfence</font>
for <font face="monospace, monospace">seq_cst</font> atomic
fencing. One could instead use a locked idempotent atomic
accesses to top-of-stack such as <font face="monospace,
monospace">lock or4i [RSP-8] 0</font>. Philip has measured
this as equivalent on micro-benchmarks, but as ~25% faster
in macro-benchmarks (other codebases confirm this). There's
one problem with this approach: non-temporal accesses on x86
are only ordered by fence instructions! This means that code
using non-temporal accesses can't rely on LLVM's <font
face="monospace, monospace">fence</font> opcode to do the
right thing, they instead have to rely on
architecture-specific <font face="monospace, monospace">_mm*fence</font>
intrinsics.</div>
</div>
</div>
</blockquote>
Just for clarify: the proposal to change the implementation of
ceq_cst is arguable separate from this proposal. It will go through
normal patch review once the semantics are addressed. Whatever we
end up doing with ceq_cst, we currently have a semantic hole in our
specification around non-temporals that needs addressed. <br>
<br>
Another approach would be to define the current fences as fencing
non-temporals and introducing new ones that don't. Either approach
is workable. I believe that new fences for non-temporals are the
appropriate choice given that would more closely match existing
practice. <br>
<br>
We could also consider forward serialize bitcode to the stronger
form whichever choice we made. That would be conservatively correct
thing to do for older bitcode which might be assuming strong
semantics than our barriers explicitly provided.<br>
<blockquote
cite="mid:CABdywOcm1o4m6mb=d962WoiX_389hbMV5o65sXmvo4jn3wan6w@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>
<div><br>
</div>
<div><br>
</div>
<div><b>But wait! Who said developers need to issue any type
of fence when using non-temporals?</b></div>
<div><br>
</div>
<div>Well, the LLVM memory model sure didn't. The x86 memory
model does (volume 3 section 8.2.2 Memory Ordering) but LLVM
targets more than x86 and the backends are free to ignore
the <font face="monospace, monospace">!nontemporal</font>
metadata, and AFAICT the x86 backend doesn't add those
fences.</div>
<div><br>
</div>
<div>Therefore even without the above optimization the LLVM
language reference is incorrect: non-temporals should be
bracketed by barriers. This applies even without threading!
Non-temporal accesses aren't guaranteed to interact well
with regular accesses, which means that regular loads cannot
move "down" a non-temporal barrier, and regular stores
cannot move "up" a non-temporal barrier.</div>
<div><br>
</div>
<div><br>
</div>
<div><b>Why not just have the compiler add the fences?</b></div>
<div><br>
</div>
<div>LLVM could do this, either as a per-backend thing or a
hookable pass such as <font face="monospace, monospace">AtomicExpandPass</font>.
It seems more natural to ask the programmer to express
intent, just as is done with atomics. In fact, a backend is
current free to ignore <span
style="font-family:monospace,monospace">!nontemporal</span> on
load and store and could therefore generate only half of
what's requested, leading to incorrect code. That would of
course be silly, backends should either honor all <span
style="font-family:monospace,monospace">!nontemporal</span> or
none of them but who knows what the middle-end does.</div>
<div><br>
</div>
<div>Put another way: some optimized C library use
non-temporal accesses (when string instructions aren't du
jour) and they terminate their copying with an <font
face="monospace, monospace">sfence</font>. It's a de-facto
convention, the ABI doesn't say anything, but let's avoid
divergence.</div>
<div><br>
</div>
<div>Aside: one day we may live in <a moz-do-not-send="true"
href="http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html">the
fence elimination promised land</a> where fences are
exactly where they need to be, no more, no less.<br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><b>Isn't x86's <font face="monospace, monospace">lfence</font>
just a no-op?</b></div>
<div><br>
</div>
<div>Yes, but we're proposing the addition of a
target-independent non-temporal load barrier. It'll be up to
the x86 backend to make it an <font face="monospace,
monospace">X86ISD::MEMBARRIER</font> and other backends to
get it right (hint: it's not always a no-op).</div>
<div><br>
</div>
<div><br>
</div>
<div><b>Won't this optimization cause coherency misses? C++
access the thread stack concurrently all the time!</b></div>
<div>
<div><br>
</div>
<div>Maybe, but then it isn't much of an optimization if
it's slowing code down. LLVM doesn't just target C++, and
it's really up to the backend to decide whether one fence
type is better than another (on x86, whether a locked
top-of-stack idempotent operation is better than <font
face="monospace, monospace">mfence</font>). Other
languages have private stacks where this isn't an issue,
and where the stack top can reasonably be assumed to be in
cache.</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div><b>How will this affect non-user-mode code (i.e. kernel
code)?</b></div>
<div><br>
</div>
<div>Kernel code still has to ask for _mm_<font
face="monospace, monospace">mfence</font> if it wants <font
face="monospace, monospace">mfence</font>: C11 and C++11
barriers aren't specified as a specific instruction.</div>
<div><br>
</div>
<div><br>
</div>
<div><b>Is it safe to access top-of-stack?</b></div>
<div><br>
</div>
<div>AFAIK yes, and the ABI-specified red zone has our back
(or front if the stack grows up ☻).</div>
<div><br>
</div>
<div><br>
</div>
<div><b>What about non-x86 architectures?</b></div>
<div><br>
</div>
<div>Architectures such as ARMv8 support non-temporal
instructions and require barriers such as <font
face="monospace, monospace">DMB nshld</font> to order
loads and <font face="monospace, monospace">DMB nshst</font>
to order stores.</div>
<div><br>
</div>
</div>
Even ARM's address-dependency rule (a.k.a. the ill-fated <font
face="monospace, monospace">std::memory_order_consume</font>)
fails to hold with non-temporals:<br>
<blockquote style="margin:0px 0px 0px
40px;border:none;padding:0px">
<div>
<div>
<div><font face="monospace, monospace">LDR X0, [X3]</font></div>
</div>
</div>
<div>
<div>
<div><font face="monospace, monospace">LDNP X2, X1, [X0]
// X0 may not be loaded when the instruction executes!</font></div>
</div>
</div>
</blockquote>
<div>
<div><br>
</div>
<div><br>
</div>
<div><b>Who uses non-temporals anyways?</b></div>
<div><br>
</div>
<div>That's an awfully personal question!</div>
</div>
</div>
</blockquote>
<br>
</body>
</html>