[LLVMdev] Predicated Vector Operations

Thu May 9 08:36:56 PDT 2013

On Thu, May 9, 2013 at 7:47 AM, <dag at cray.com> wrote:

> Chandler Carruth <chandlerc at google.com> writes:
>
> > What are the desired memory model semantics for a masked store?
> > Specifically, let me suppose a simplified vector model of <2 x i64> on
> > an i64-word-size platform.
> >
> > masked_store(<42, 42>, Ptr, <true, false>)
> >
> > Does this write to the entier <2 x i64> object stored at Ptr or not?
>
> No.  It writes one element.
>
> > Put another way, consider:
> >
> > thread A:
> > ...
> > masked_store(<42, 42>, Ptr, <true, false>)
> > ...
> >
> > thread B:
> > ...
> > masked_store(<42, 42>, Ptr, <false, true>)
> > ...
> >
> > Assuming there is no specific synchronization relevant to Ptr between
> > these two threads and their masked stores, does this form a data race
> > or not?
>
> It entirely depends on the hardware implementation.  In most cases I
> would say yes due to cache conherence issues.  From a purely theoretical
> machine that doesn't have false sharing, there would be no data race.
>
> Of course this assumes that thread B won't access the element stored by
> thread A and vice versa.
>
> > From a memory model perspective, if this does *not* form a data race,
> > that makes this tremendously more complex to implement, analyze, and
> > optimize... I'm somewhat hopeful that the desired semantics are for
> > this to form a datarace (and thus require synchronization when
> > occurring in different threads like this).
>
> Most of the time the compiler will not know the mask value and will have
> to be conservative.  As Nadav has pointed out, what constitutes
> "conservative" is entirely context-dependent.
>
> But I don't understand why defining this as not being a data race would
> complicate things.  I'm assuming the mask values are statically known.
> Can you explain a bit more?
>

It's an interesting question for autovectorization, for example.

Thread A:
   for (i=0;i<n;++i)
      if (i&1)
        X[i] = 0;

Thread B:
   for (i=0;i<n;++i)
      if (!(i&1))
        X[i] = 1;

The threads run concurrently without synchronization. As written, there is
no race. Can you vectorize either of these loops? If masked-out elements of
a predicated store are "in play" for racing, then vectorizing would
introduce a race. And, it'd be hard for an optimizer to prove that this
doesn't happen.

Dan

p.s. Yes, you could also vectorize these with a strided store or a scatter,
but then it raises a different question, of the memory semantics for
strided or scatter stores.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130509/04710130/attachment.html>