<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <span dir="ltr"><<a href="mailto:jfb@google.com" target="_blank">jfb@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <span dir="ltr"><<a href="mailto:david.majnemer@gmail.com" target="_blank">david.majnemer@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span>On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <span dir="ltr"><<a href="mailto:jfb@google.com" target="_blank">jfb@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span>On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span>On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">I agree with Tim's assessment for ARM.  That's interesting; I wasn't previously aware of that instruction.<div><br></div><div>My understanding is that Alpha would have the same problem for normal loads.<div><br></div><div>I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses.</div><div><br></div><div>AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear.  Nontemporal stores should probably ideally use an SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on?</div><div><br></div><div>I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also think that risks unexpected coherence miss problems, though they would probably be very rare.  But they would be very surprising if they did occur.</div></div></div></blockquote><div><br></div></span><div>Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when targeting 32-bit x86 machines which do not support mfence.  What instruction sequence should we be using instead?</div></div></div></div></blockquote><div><br></div></span><div>Do they have non-temporal accesses in the ISA?</div></div></div></div></blockquote><div><br></div></span><div>I thought not but there appear to be instructions like movntps.  mfence was introduced in SSE2 while movntps and sfence were introduced in SSE.</div></div></div></div></blockquote><div><br></div></span><div>So the new builtin could be sfence? I think the codegen you point out for SEQ_CST is fine if we fix the memory model as suggested.</div></div></div></div></blockquote><div><br></div><div>I agree that it's fine to use a locked instruction as a seq_cst fence if MFENCE is not available.  If you have to dirty a cache line, (%esp) seems like relatively safe one.  (I'm assuming that CPUID is appreciably slower and out of the running?  I haven't tried.  But it also probably clobbers too many registers.)  It's only the idea of writing to a memory location when MFENCE is available, and could be used instead, that seems questionable.</div><div><br></div><div>What exactly would the non-temporal fences be?  It seems that on x86, the load and store case may differ.  In theory, there's also a before vs. after question.  In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards.  I can't back that up with spec verbiage.  I don't know about MOVNTDQA.  What about ARM?</div></div></div></div>