[llvm-dev] RFC: non-temporal fencing in LLVM IR

Thu Jan 14 16:27:09 PST 2016


On 01/14/2016 04:05 PM, Hans Boehm via llvm-dev wrote:
>
>
> On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <jfb at google.com 
> <mailto:jfb at google.com>> wrote:
>
>     On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer
>     <david.majnemer at gmail.com <mailto:david.majnemer at gmail.com>> wrote:
>
>
>
>         On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com
>         <mailto:jfb at google.com>> wrote:
>
>             On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via
>             llvm-dev <llvm-dev at lists.llvm.org
>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>
>                 On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via
>                 llvm-dev <llvm-dev at lists.llvm.org
>                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>                     I agree with Tim's assessment for ARM.  That's
>                     interesting; I wasn't previously aware of that
>                     instruction.
>
>                     My understanding is that Alpha would have the same
>                     problem for normal loads.
>
>                     I'm all in favor of more systematic handling of
>                     the fences associated with x86 non-temporal accesses.
>
>                     AFAICT, nontemporal loads and stores seem to have
>                     different fencing rules on x86, none of them very
>                     clear. Nontemporal stores should probably ideally
>                     use an SFENCE. Locked instructions seem to be
>                     documented to work with MOVNTDQA.  In both cases,
>                     there seems to be only empirical evidence as to
>                     which side(s) of the nontemporal operations they
>                     should go on?
>
>                     I finally decided that I was OK with using a
>                     LOCKed top-of-stack update as a fence in Java on
>                     x86.  I'm significantly less enthusiastic for
>                     C++.  I also think that risks unexpected coherence
>                     miss problems, though they would probably be very
>                     rare. But they would be very surprising if they
>                     did occur.
>
>
>                 Today's LLVM already emits 'lock or %eax, (%esp)' for
>                 'fence
>                 seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST)
>                 when targeting 32-bit x86 machines which do not
>                 support mfence.  What instruction sequence should we
>                 be using instead?
>
>
>             Do they have non-temporal accesses in the ISA?
>
>
>         I thought not but there appear to be instructions
>         like movntps.  mfence was introduced in SSE2 while movntps and
>         sfence were introduced in SSE.
>
>
>     So the new builtin could be sfence? I think the codegen you point
>     out for SEQ_CST is fine if we fix the memory model as suggested.
>
>
> I agree that it's fine to use a locked instruction as a seq_cst fence 
> if MFENCE is not available.
It's not clear to me this is true if the seq_cst fence is expected to 
fence non-temporal stores.  I think in practice, you'd be very unlikely 
to notice a difference, but I can't point to anything in the Intel docs 
which justifies a lock prefixed instruction as sufficient to fence any 
non-temporal access.

> If you have to dirty a cache line, (%esp) seems like relatively safe one.
Agreed.  As we discussed previously, it is possible to false sharing in 
C++, but this would require one thread to be accessing information 
stored in the last frame of another running thread's stack.  That seems 
sufficiently unlikely to be ignored.

> (I'm assuming that CPUID is appreciably slower and out of the 
> running?  I haven't tried.  But it also probably clobbers too many 
> registers.)
This is my belief.  I haven't actually tried this experiment, but I've 
seen no reports that CPUID is a good choice here.

> It's only the idea of writing to a memory location when MFENCE is 
> available, and could be used instead, that seems questionable.
While in principal I agree, it appears in practice that this tradeoff is 
worthwhile.  The hardware doesn't seem to optimize for the MFENCE case 
whereas lock prefix instructions appear to be handled much better.
>
> What exactly would the non-temporal fences be?  It seems that on x86, 
> the load and store case may differ.  In theory, there's also a before 
> vs. after question.  In practice code using MOVNTA seems to assume 
> that you only need an SFENCE afterwards.  I can't back that up with 
> spec verbiage.  I don't know about MOVNTDQA.  What about ARM?
I'll leave this to JF to answer.  I'm not knowledgeable enough about 
non-temporals to answer without substantial research first.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/cd9f16a8/attachment.html>