<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 14, 2016 at 4:27 PM, Philip Reames <span dir="ltr"><<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">It's not clear to me this is true if the seq_cst fence is expected
to fence non-temporal stores. I think in practice, you'd be very
unlikely to notice a difference, but I can't point to anything in
the Intel docs which justifies a lock prefixed instruction as
sufficient to fence any non-temporal access. <br></div></blockquote><div><br></div><div>Agreed. I think it's not guaranteed. And the most rational explanation for the fact that LOCK; X is faster than MFENCE seems to be that LOCK only deals with normal write-back cacheable accesses, and hence may not work for cases like this.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF"><span class="">
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>If you have to dirty a cache line, (%esp) seems like
relatively safe one. <br>
</div>
</div>
</div>
</div>
</blockquote></span>
Agreed. As we discussed previously, it is possible to false sharing
in C++, but this would require one thread to be accessing
information stored in the last frame of another running thread's
stack. That seems sufficiently unlikely to be ignored. <br></div></blockquote><div><br></div><div>I disagree with the reasoning, but not really with the conclusion. Starting a thread with a lambda that captures locals by reference is likely to do this, and is a common C++ idiom, especially in textbook examples. This is aggravated by the fact that I don't understand the hardware prefetcher, and that it sometimes seems to fetch an adjacent line. (Note that C, unlike C++, allows implementations to make thread stacks inaccessible to other threads. Some of us consider that a bug and would refuse to use a general purpose implementation that actually did this. I suspect there are enough of us that it doesn't matter.)</div><div><br></div><div>I think a stronger argument is that the compiler is always allowed to push temporaries on the stack. So this looks exactly as though a sequentially consistent fence required a stack temporary.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF"><span class=""><br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>It's only the idea of writing to a memory location when
MFENCE is available, and could be used instead, that seems
questionable.</div>
</div>
</div>
</div>
</blockquote></span>
While in principal I agree, it appears in practice that this
tradeoff is worthwhile. The hardware doesn't seem to optimize for
the MFENCE case whereas lock prefix instructions appear to be
handled much better.<br></div></blockquote><div>The concern is that it is actually fairly easy to get contention as a result in C++. And programmers might think they know that certain fences shouldn't use temporaries and the rest of their code should run in registers. But I agree this is not a completely clear call. I wish x86 provided a plain fence instruction that handled the common case efficiently, so we could avoid these trade-offs. (A "sequentially consistent store" instruction might be even better, in that it should largely eliminate fences and allows other optimizations.)</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF"><span class="">
</span></div></blockquote></div><br></div><div class="gmail_extra">Hans</div></div>