[LLVMdev] Predicated Vector Operations

Thu May 9 09:04:39 PDT 2013

Dan Gohman <dan433584 at gmail.com> writes:

>     But I don't understand why defining this as not being a data race
>     would complicate things. I'm assuming the mask values are
>     statically known.  Can you explain a bit more?
>
> It's an interesting question for autovectorization, for example.
>
> Thread A:
> for (i=0;i<n;++i)
> if (i&1)
> X[i] = 0;
>
> Thread B:
> for (i=0;i<n;++i)
> if (!(i&1))
> X[i] = 1;
>
> The threads run concurrently without synchronization. As written,
> there is no race. 

There is no race *if* the hardware cache coherence says so.  :) There
are false sharing issues here and different machines have behaved very
differently in the past.

The result entirely depends on the machine's consistency model.

LLVM is a virtual machine and the IR should define a consistency model.
Everything flows from that.  I think ideally we'd define the model such
that there is no race in the scalar code and the compiler would be free
to vectorize it.  This is a very strict consistency model and for
targets with relaxed semantics, LLVM would have to insert
synchronization operations or choose not to vectorize.

Presumably if the scalar code were problematic on a machine with relaxed
consistency, the user would have added synchronization primitives and
vectorization would not be possible.

> Can you vectorize either of these loops? If masked-out elements of a
> predicated store are "in play" for racing, then vectorizing would
> introduce a race. And, it'd be hard for an optimizer to prove that
> this doesn't happen.

Same answer.  I don't think scalar vs. vector matters.  This is mostly a
cache coherence issue.

There is one twist that our vectorization guy pointed out to me.  If
when vectorizing, we have threads A and B read the entire vector, update
the values under mask and then write the entire vector, clearly there
will be a data race introduced.  The Cray compiler has switches for
users to balance safety and performance, since a stride-one load and
store is generally much faster than a masked load and store.

So for vectorization, the answer is, "it depends on the target
consistency model and the style of vectorization chosen."

> p.s. Yes, you could also vectorize these with a strided store or a
> scatter, but then it raises a different question, of the memory
> semantics for strided or scatter stores.

And again, the same answer.  :)

I'm no vectorization expert, but I believe what I said is correct.  :)

                         -David