[llvm-dev] RFC: non-temporal fencing in LLVM IR

Fri Jan 15 12:04:11 PST 2016

On Fri, Jan 15, 2016 at 12:15 AM, JF Bastien <jfb at google.com> wrote:
>
> What exactly would the non-temporal fences be?  It seems that on x86, the
>> load and store case may differ.  In theory, there's also a before vs. after
>> question.  In practice code using MOVNTA seems to assume that you only need
>> an SFENCE afterwards.  I can't back that up with spec verbiage.  I don't
>> know about MOVNTDQA.  What about ARM?
>>
>> I'll leave this to JF to answer.  I'm not knowledgeable enough about
>> non-temporals to answer without substantial research first.
>>
>
> I'm proposing two builtins:
> - __builtin_nontemporal_load_fence
> - __builtin_nontemporal_store_fence
>
> I've I've got this right, on x86 they would respectively be a nop, and
> sfence.
>
> They otherwise act as memory code motion barriers unless accesses are
> proven to not alias. I think it may be possible to loosen the rule so they
> act closer to acquire/release (allowing accesses to move into the pair) but
> I'm not convinced that this works for every ISA so I'd err on the side of
> caution (since this can be loosened later).
>
> What would the semantics be?  They restore the normal architectural
ordering guarantees relied upon by the synchronization primitives, so that
non-temporal accesses don't need to be considered when  implementing
synchronization?

Then I think an SFENCE following x86 non-temporal stores would be correct.
And empirically we don't need anything to before a non-temporal store to
order it with respect to earlier normal stores.  But I don't the latter
conclusion follows from the spec.

I looked at the MOVNTDQA non-temporal load documentation again, and I'm
confused.  It sounds like so long as the memory is WB-cacheable, we may be
OK without any fences.  But I can't tell that for sure.  In the WC case, a
LOCKed instruction seems to be documented to work as a fence.

In the ARM LDNP case, things seem to be messy.  I don't think we currently
need fences for C++, since we don't normally use the dependency-based
ordering guarantees.  (Except to prevent out-of-thin-air results, which
don't seem to be precluded by the ARM spec.  Intentional or bug?)  But the
difference does matter when implementing Java final fields or
memory_order_consume.

I'm actually getting a little worried that these things are just too
idiosynchratic to reflect in portable intrinsics.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/1d397ab9/attachment.html>