[LLVMdev] Atomic Operation and Synchronization Proposal v2

Thu Jul 12 14:51:38 PDT 2007

On Thursday 12 July 2007 13:08, Chandler Carruth wrote:

> > > Right.  For example, the Cray X1 has a much richer set of memory
> > > ordering instructions than anything on the commodity micros:
> > >
> > > http://tinyurl.com/3agjjn
> > >
> > > The memory ordering intrinsics in the current llvm proposal can't take
> > > advantage of them because they are too coarse-grained.
> >
> > I guess the descriptions on that page are, heh, a little terse ;-).
>
> A bit. ;] I was glad to see your clarification.

Yeah, sorry.  I was heading out the door to a meeting when I posted the link.
I'm glad Dan clarified some things.  Unfortunately, I could only link to 
publicly-available documents.  Our internal ISA book explains this all much 
better.  :)

> > The
> > Cray X1 has a dimension of synchronization that isn't covered in this
> > proposal, and that's the set of observers need to observe the ordering.
> > For example you can synchronize a team of streams in a multi-streaming
> > processor without requiring that the ordering of memory operations be
> > observed by the entire system. That's what motivates most of the variety
> > in that list.
>
> This is fascinating to me, personally. I don't know how reasonable it
> is to implement directly in LLVM, however, could a codegen for the X1
> in theory establish if the "shared memory" was part of a stream in a
> multi-streaming processor, and use those local synchronization
> routines?

Absolutely.  The X1 compiler is responsible for partitioning loops to
run on multiple streams and synchronizing among the streams as
necessary.  That synchronization is at a level "above" general system
memory ordering.  The X1 has multiple levels of parallelism:

- Vectorization

- Decoupled vector/scalar execution (this is where the lsyncs come in)

- Multistreaming (the msync operations)

- Multiprocessing (global machine-wide synchronization via gsync)

The compiler is basically responsible for the first three levels while the
user does the fourth via MPI, OpenMP, CAF, UPC, etc.  In general
sometimes the user inserts directives to help the compiler with 1-3
but the compiler gets a lot of cases on its own automatically.

> I'm not sure how reasonable this is. Alternatively, to 
> target this specific of an architecture, perhaps the LLVM code could
> be annotated to show where it is operating on streams, versus across
> processors, and allow that to guide the codegen decision as to which
> type of synchronization to utilize. As LLVM doesn't really understand
> the parallel implementation the code is running on, it seems like it
> might be impossible to build this into LLVM without it being
> X1-type-system specific... but perhaps you have better ideas how to do
> such things from working on it for some time?

In a parallelizing compiler, the compiler must keep track of where it placed
data when it parallelized code as it must know how to handle dependencies
and insert synchronizations.  In the case of the X1, the compiler partitioned 
a loop to run on multiple cores, so it knows to use msyncs when that code 
accesses data shared among the cores.  The compiler determined which
data to share among the cores and which to keep private in each core.
Similarly, when it vectorizes, it knows the dependencies between vector and 
scalar operations and inserts the necessary lsyncs.

PGI, Pathscale and Intel, for example, are starting to talk about automatic 
OpenMP.  They will need to insert synchronizations across cores similarly to 
what's done on the X1.  Those will probably be some form of MFENCE.

The abstraction here is the "level" of parallelism.  Vectorization is very
fine-grained.  Most implementations in hardware do not  need explicit software 
syncs between scalar and vector code.

The next level up is multithreading (we call that multistreaming on the X1
for historical reasons).  Depending on architecture, this could  happen within 
a single core (MTA style) or across multiple cores (X1 style), providing two 
distinct levels of parallelism and possibly two distinct sets of sync 
instructions in the general case.

Then you've got a level of parallelism around the set of sockets that are
cache coherent (so-called "local" processors, or a "node" in X1 parlance).
You might have another set of sync instructions for this (the X1 does not).
Then you have the most general case of parallelism across "the system"
where communication time between processors is extremely long.  This is
the "gsync" level on the X1.

Other more sophisticated architectures may have even more levels of 
parallelism.

So in thinking about extending your work (which again, I want to stress is
not immediately necessary, but still good to think about), I would suggest
we think in terms of level of parallelization or perhaps "distance among
participants."  It's not a good idea to hard-code things like "vector-scalar
sync" but I can imagine intrinsics that say, "order memory among these
participants," or "order memory at this level," or "order memory between these 
levels," where the levels are defined by the target architecture.  If a target 
doesn't have as many levels as used in the llvm code, then it can just choose 
to use a more expensive sync instruction.  In X1 terms, a gsync is a really 
big hammer, but it can always be used in place of an lsync.

I don't know if any plans exist to incorporate parallelizing transformations 
into llvm, but I can certainly imagine building an auto-parallelizing 
infrastructure above it.  That infrastructure would have to communicate 
information down to llvm so it could generate code properly.  How to do that 
is an entirely other can of worms.  :)

> > There's one other specific aspect I'd like to point out here. There's an
> > "acquire" which orders prior *scalar* loads with *all* subsequent memory
> > accesses, and a "release" which orders *all* prior accesses with
> > subsequent *scalar* stores. The Cray X1's interest in distinguishing
> > scalar accesses from vector accesses is specific to its architecture, but
> > in general, it is another case that motivates having more granularity
> > than just "all loads" and "all stores".
>
> This clarifies some of those instructions. Here is my thought on how
> to fit this behavior in with the current proposal:
>
> You're still ordering load-store pairings, there is juts the added
> dimensionality of types. This seems like an easy extension to the
> existing proposal to combine the load and store pairings with a type
> dimension to achieve finer-grained control. Does this make sense as an
> incremental step from your end with much more experience comparing
> your hardware to LLVM's IR?

This would work for X1-style lsyncs, but we should think about whether this is 
too architecture-specific.  Decoupled execution doesn't fit completely snugly 
into the "levels of parallelism" model I outlined above, so it's a bit of an 
oddball.  It's parallelism, but of a different form.  Commodity micros have
decoupled execution but they handle syncs in hardware (thus moving to/from 
a GPR and XMM is expensive).

The X1 fsync falls into the same category.  It's there because the X1 does not 
have precise traps for floating point code and doesn't really have anything to 
do with parallelization.  Ditto isync (all modern processors have some form 
of this to guard against self-modifying code).

The bottom line is that I don't have easy cut-and-dry answers.  I suspect this 
will be an organic process and we'll learn how to abstract these things in an 
architecture-independent manner as we go.

                                                          -Dave