[LLVMdev] Proposal for a new LLVM concurrency memory model

Tue Apr 27 08:46:26 PDT 2010

On Monday 26 April 2010 15:53:31 Renato Golin wrote:
> On 26 April 2010 21:09, David Greene <dag at cray.com> wrote:
> > Vector atomics are extremely useful on architectures that support them.
> >
> > I'm not sure we need atomicity across vector elements, so decomposing
> > shouldn't be a problem, but I will have to think about it a bit.
>
> What is the semantics for vectorization across atomic vector operations?
>
> Suppose I atomically write in thread 1 and read in thread 2, to a
> vector with 64 elements. If I do automatic vectorization, it'd naively
> be converted into N operations of 64/N-wide atomically writes and
> reads, but not necessarily reading block k on thread 2 would happen
> before writing it on thread 1, supposing reads are much faster than
> writes.
>
> I suppose one would have to have great care when doing such
> transformations, to keep the same semantics. For instance, splitting
> in two loops and putting a barrier between them, thus back to the
> original design.

So I think there are at least two cases here.

The first case is the one you outline: a produce-consumer relationship.
In that case we would have to respect atomicity across vector elements,
so that a read in thread 2 would not get some elements with the updates
value of the write in thread 1 and some elements with the old value.

The second case is a pure atomic update: a bunch of thread collaborate
to produce a set of values.  A partial reduction, for example.  A bunch
of threads in a loop atomically operate on a vector, for example
computing a vector sum into it via an atomic add.  After this operation
the code does a barrier sync and continues with the next phase.  In
this case there is no producer-consumer relationship within the loop
(everyone's producing/updating) so we don't need to worry about respecting 
atomicity across elements.

My intuition is that the second case is more important (in the sense of 
computation time) than the first, but I will have to talk to some people here 
more familiar with the common codes than I am.  The first case might be used 
for boundary updates and that kind of thing while the second case is used for 
the meat of the computation.

It shouldn't be very hard for the compiler to detect the second case.
It's a pretty straightforward pattern.  For everything else it would
have to assume case #1.

So perhaps we want two kinds of vector atomic: one that respects
atomicity across elements and one that doesn't.

Of course this only matters when looking at decomposing vector atomics
into scalars.  I think it is probably a better strategy just to not
generate the vector atomics in the first place if the target doesn't support 
them.  Then we only need one kind: the one that respects atomicity across 
elements.

                               -Dave