[LLVMdev] Predicated Vector Operations

dag at cray.com dag at cray.com
Mon May 13 07:51:42 PDT 2013

Chandler Carruth <chandlerc at google.com> writes:

>     > What are the desired memory model semantics for a masked store?
>     > Specifically, let me suppose a simplified vector model of <2 x
>     i64> on
>     > an i64-word-size platform.
>     >
>     > masked_store(<42, 42>, Ptr, <true, false>)
>     >
>     > Does this write to the entier <2 x i64> object stored at Ptr or
>     not?
>     No. It writes one element.
> Is this a statement about all of the existing hardware that supports
> masked stores, or about the desired semantics in your mind for the IR
> model?

I made the comment thinking about hardware.  If it were to write all
elements, what would it write?  The old value?  That could be useful in
some cases (the performance issue I mentioned below).  But it also
presents problems for mem2reg/SSA.

This discussion might lead us to wanting a couple of flavors of masked
load and store.  I'm not sure.

>     > Put another way, consider:
>     >
>     > thread A:
>     > ...
>     > masked_store(<42, 42>, Ptr, <true, false>)
>     > ...
>     >
>     > thread B:
>     > ...
>     > masked_store(<42, 42>, Ptr, <false, true>)
>     > ...
>     >
>     > Assuming there is no specific synchronization relevant to Ptr
>     between
>     > these two threads and their masked stores, does this form a data
>     race
>     > or not?
>     It entirely depends on the hardware implementation.
> Well, in practice yes, but I'm asking how it should be modeled in the
> IR. We aren't constrained to the option of having a *different* masked
> store intrinsic for every hardware architecture that supports masked
> stores, and it would seem strange to not look for a reasonable target
> independent abstraction which we can teach the middle-end optimizers
> about (even if it does take the form of intrinsics). Maybe there is no
> such abstraction? That in and of itself would be surprising to me.

I agree.  We should choose the semantics that gives the optimizer the
most freedom.

> From Nadav's link, for example, it appears that AVX *does* actually do
> a full store of the 256-bit vector, but it does it atomically which
> precludes data races.

By "atomically," you mean that all elements are written before any other
operation is allowed to read or store to them?

> If my example would form a datarace, then when the optimizer sees such
> a non-atomic stores, *and doesn't know the mask statically* (as I
> agree, that is the common case), it would know that the entire vector
> store was independent from stores to the same memory locations in
> other threads. It could even model the operation of a masked store as
> a load from that address, masking both the loaded vector and the
> incoming vector (literally), or-ing (or blending) them together, and
> storing the result back out. This is a fairly simple model, and easy
> to reason about. I would even suggest that perhaps this is how we
> should represent it in IR.

Note that from a hardware perspective, the store may or may not cause a
data race depending on alignment and whether the store crosses a cache
line boundary.

> However, if my example does not form a datarace, then when the
> optimizer sees such a non-atomic store, *and doesn't know the mask
> statically*, it has to assume that the mask may dynamically preclude
> storing to a memory location that is being concurrently accessed. It
> cannot speculatively load the vector stored there, and perform an
> explicit mask and blend and a full store. It essentially means that if
> we see a masked store and don't know the mask, then even if we know
> the address statically, that doesn't matter because the mask could
> effectively index into that address and select a single element to
> store to.

You want to do a full vector load, update and store.  So do we most of
the time, for performance.  :)

I think you may be getting hung up a bit.  What you describe in this
paragraph isn't a masked store at all.  In fact you explicitly state
it's "a full store."

Given your code snippet, if we assume there is no data race in the
scalar code *and* we assume that vector stores are atomic, then the
compiler has two choices on how to translate the code.  It can be unsafe
and do the fast full load, merge, full store sequence or it can do a
slower hardware masked store.  The Cray compiler will do both depending
on switches or directives given by the user.  It is too difficult to
know statically which is the best choice.  I believe that by default we
do the fast code and the user can turn on switches to generate safe code
and see if that fixes problems.  :)

I think we wil need both options in LLVM.  There's just no way to pick
one and always have the write answer.  I think we only need one masked
store intrinsic.  That intrinsics would *not* write to masked elements.
If the compiler wants to be entirely safe, it should use that.
Otherwise it should feel free to use the full load/merge/full store

I will run this by our vector expert to see what he thinks.


More information about the llvm-dev mailing list