[LLVMdev] Predicated Vector Operations

Sat May 11 22:36:18 PDT 2013

On Thu, May 9, 2013 at 4:47 PM, <dag at cray.com> wrote:

> Chandler Carruth <chandlerc at google.com> writes:
>
> > What are the desired memory model semantics for a masked store?
> > Specifically, let me suppose a simplified vector model of <2 x i64> on
> > an i64-word-size platform.
> >
> > masked_store(<42, 42>, Ptr, <true, false>)
> >
> > Does this write to the entier <2 x i64> object stored at Ptr or not?
>
> No.  It writes one element.
>

Is this a statement about all of the existing hardware that supports masked
stores, or about the desired semantics in your mind for the IR model?

>
> > Put another way, consider:
> >
> > thread A:
> > ...
> > masked_store(<42, 42>, Ptr, <true, false>)
> > ...
> >
> > thread B:
> > ...
> > masked_store(<42, 42>, Ptr, <false, true>)
> > ...
> >
> > Assuming there is no specific synchronization relevant to Ptr between
> > these two threads and their masked stores, does this form a data race
> > or not?
>
> It entirely depends on the hardware implementation.

Well, in practice yes, but I'm asking how it should be modeled in the IR.
We aren't constrained to the option of having a *different* masked store
intrinsic for every hardware architecture that supports masked stores, and
it would seem strange to not look for a reasonable target independent
abstraction which we can teach the middle-end optimizers about (even if it
does take the form of intrinsics). Maybe there is no such abstraction? That
in and of itself would be surprising to me.

> In most cases I
> would say yes due to cache conherence issues.  From a purely theoretical
> machine that doesn't have false sharing, there would be no data race.
>

I think you're trying to reason about this from a hardware perspective, and
I'm trying to talk about what the right theoretical model for the memory
model is... While hardware is one constraint on that, there are others as
well, so it's worrisome to talk about both a theoretical machine without
false sharing and cache coherency issues when trying to determine if two
operations form a data race.

>
> Of course this assumes that thread B won't access the element stored by
> thread A and vice versa.
>

If we assume that, we have the conclusion -- there is on datarace. The
entire question is whether or not the masked element is notionally "stored"
(but without changing the value afterward), or not.

>From Nadav's link, for example, it appears that AVX *does* actually do a
full store of the 256-bit vector, but it does it atomically which precludes
data races.

> From a memory model perspective, if this does *not* form a data race,
> > that makes this tremendously more complex to implement, analyze, and
> > optimize... I'm somewhat hopeful that the desired semantics are for
> > this to form a datarace (and thus require synchronization when
> > occurring in different threads like this).
>
> Most of the time the compiler will not know the mask value and will have
> to be conservative.  As Nadav has pointed out, what constitutes
> "conservative" is entirely context-dependent.

> But I don't understand why defining this as not being a data race would
> complicate things.  I'm assuming the mask values are statically known.
> Can you explain a bit more?
>

If my example would form a datarace, then when the optimizer sees such a
non-atomic stores, *and doesn't know the mask statically* (as I agree, that
is the common case), it would know that the entire vector store was
independent from stores to the same memory locations in other threads. It
could even model the operation of a masked store as a load from that
address, masking both the loaded vector and the incoming vector
(literally), or-ing (or blending) them together, and storing the result
back out. This is a fairly simple model, and easy to reason about. I would
even suggest that perhaps this is how we should represent it in IR.

However, if my example does not form a datarace, then when the optimizer
sees such a non-atomic store, *and doesn't know the mask statically*, it
has to assume that the mask may dynamically preclude storing to a memory
location that is being concurrently accessed. It cannot speculatively load
the vector stored there, and perform an explicit mask and blend and a full
store. It essentially means that if we see a masked store and don't know
the mask, then even if we know the address statically, that doesn't matter
because the mask could effectively index into that address and select a
single element to store to.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130512/0779ac21/attachment.html>