[LLVMdev] Predicated Vector Operations
dag at cray.com
dag at cray.com
Thu May 9 09:04:39 PDT 2013
Dan Gohman <dan433584 at gmail.com> writes:
> But I don't understand why defining this as not being a data race
> would complicate things. I'm assuming the mask values are
> statically known. Can you explain a bit more?
>
> It's an interesting question for autovectorization, for example.
>
> Thread A:
> for (i=0;i<n;++i)
> if (i&1)
> X[i] = 0;
>
> Thread B:
> for (i=0;i<n;++i)
> if (!(i&1))
> X[i] = 1;
>
> The threads run concurrently without synchronization. As written,
> there is no race.
There is no race *if* the hardware cache coherence says so. :) There
are false sharing issues here and different machines have behaved very
differently in the past.
The result entirely depends on the machine's consistency model.
LLVM is a virtual machine and the IR should define a consistency model.
Everything flows from that. I think ideally we'd define the model such
that there is no race in the scalar code and the compiler would be free
to vectorize it. This is a very strict consistency model and for
targets with relaxed semantics, LLVM would have to insert
synchronization operations or choose not to vectorize.
Presumably if the scalar code were problematic on a machine with relaxed
consistency, the user would have added synchronization primitives and
vectorization would not be possible.
> Can you vectorize either of these loops? If masked-out elements of a
> predicated store are "in play" for racing, then vectorizing would
> introduce a race. And, it'd be hard for an optimizer to prove that
> this doesn't happen.
Same answer. I don't think scalar vs. vector matters. This is mostly a
cache coherence issue.
There is one twist that our vectorization guy pointed out to me. If
when vectorizing, we have threads A and B read the entire vector, update
the values under mask and then write the entire vector, clearly there
will be a data race introduced. The Cray compiler has switches for
users to balance safety and performance, since a stride-one load and
store is generally much faster than a masked load and store.
So for vectorization, the answer is, "it depends on the target
consistency model and the style of vectorization chosen."
> p.s. Yes, you could also vectorize these with a strided store or a
> scatter, but then it raises a different question, of the memory
> semantics for strided or scatter stores.
And again, the same answer. :)
I'm no vectorization expert, but I believe what I said is correct. :)
-David
More information about the llvm-dev
mailing list