[llvm-dev] RFC: non-temporal fencing in LLVM IR

Thu Jan 14 16:05:27 PST 2016

On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <jfb at google.com> wrote:

> On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.majnemer at gmail.com>
> wrote:
>
>>
>>
>> On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com> wrote:
>>
>>> On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> I agree with Tim's assessment for ARM.  That's interesting; I wasn't
>>>>> previously aware of that instruction.
>>>>>
>>>>> My understanding is that Alpha would have the same problem for normal
>>>>> loads.
>>>>>
>>>>> I'm all in favor of more systematic handling of the fences associated
>>>>> with x86 non-temporal accesses.
>>>>>
>>>>> AFAICT, nontemporal loads and stores seem to have different fencing
>>>>> rules on x86, none of them very clear.  Nontemporal stores should probably
>>>>> ideally use an SFENCE.  Locked instructions seem to be documented to work
>>>>> with MOVNTDQA.  In both cases, there seems to be only empirical evidence as
>>>>> to which side(s) of the nontemporal operations they should go on?
>>>>>
>>>>> I finally decided that I was OK with using a LOCKed top-of-stack
>>>>> update as a fence in Java on x86.  I'm significantly less enthusiastic for
>>>>> C++.  I also think that risks unexpected coherence miss problems, though
>>>>> they would probably be very rare.  But they would be very surprising if
>>>>> they did occur.
>>>>>
>>>>
>>>> Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence
>>>> seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when
>>>> targeting 32-bit x86 machines which do not support mfence.  What
>>>> instruction sequence should we be using instead?
>>>>
>>>
>>> Do they have non-temporal accesses in the ISA?
>>>
>>
>> I thought not but there appear to be instructions like movntps.  mfence
>> was introduced in SSE2 while movntps and sfence were introduced in SSE.
>>
>
> So the new builtin could be sfence? I think the codegen you point out for
> SEQ_CST is fine if we fix the memory model as suggested.
>

I agree that it's fine to use a locked instruction as a seq_cst fence if
MFENCE is not available.  If you have to dirty a cache line, (%esp) seems
like relatively safe one.  (I'm assuming that CPUID is appreciably slower
and out of the running?  I haven't tried.  But it also probably clobbers
too many registers.)  It's only the idea of writing to a memory location
when MFENCE is available, and could be used instead, that seems
questionable.

What exactly would the non-temporal fences be?  It seems that on x86, the
load and store case may differ.  In theory, there's also a before vs. after
question.  In practice code using MOVNTA seems to assume that you only need
an SFENCE afterwards.  I can't back that up with spec verbiage.  I don't
know about MOVNTDQA.  What about ARM?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/ff7e6e46/attachment.html>