[llvm-dev] RFC: non-temporal fencing in LLVM IR

Fri Jan 15 11:21:20 PST 2016

On Thu, Jan 14, 2016 at 4:27 PM, Philip Reames <listmail at philipreames.com>
wrote:

> It's not clear to me this is true if the seq_cst fence is expected to
> fence non-temporal stores.  I think in practice, you'd be very unlikely to
> notice a difference, but I can't point to anything in the Intel docs which
> justifies a lock prefixed instruction as sufficient to fence any
> non-temporal access.
>

Agreed.  I think it's not guaranteed.  And the most rational explanation
for the fact that LOCK; X is faster than MFENCE seems to be that LOCK only
deals with normal write-back cacheable accesses, and hence may not work for
cases like this.

> If you have to dirty a cache line, (%esp) seems like relatively safe one.
>
> Agreed.  As we discussed previously, it is possible to false sharing in
> C++, but this would require one thread to be accessing information stored
> in the last frame of another running thread's stack.  That seems
> sufficiently unlikely to be ignored.
>

I disagree with the reasoning, but not really with the conclusion.
Starting a thread with a lambda that captures locals by reference is likely
to do this, and is a common C++ idiom, especially in textbook examples.
This is aggravated by the fact that I don't understand the hardware
prefetcher, and that it sometimes seems to fetch an adjacent line.  (Note
that C, unlike C++, allows implementations to make thread stacks
inaccessible to other threads.  Some of us consider that a bug and would
refuse to use a general purpose implementation that actually did this.  I
suspect there are enough of us that it doesn't matter.)

I think a stronger argument is that the compiler is always allowed to push
temporaries on the stack.  So this looks exactly as though a sequentially
consistent fence required a stack temporary.

> It's only the idea of writing to a memory location when MFENCE is
> available, and could be used instead, that seems questionable.
>
> While in principal I agree, it appears in practice that this tradeoff is
> worthwhile.  The hardware doesn't seem to optimize for the MFENCE case
> whereas lock prefix instructions appear to be handled much better.
>
The concern is that it is actually fairly easy to get contention as a
result in C++.  And programmers might think they know that certain fences
shouldn't use temporaries and the rest of their code should run in
registers.  But I agree this is not a completely clear call.  I wish x86
provided a plain fence instruction that handled the common case
efficiently, so we could avoid these trade-offs.  (A "sequentially
consistent store" instruction might be even better, in that it should
largely eliminate fences and allows other optimizations.)

>
Hans
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/38ffac0d/attachment.html>