[LLVMdev] Atomic Operation and Synchronization Proposal v2

Fri Jul 13 08:19:18 PDT 2007

On Thursday 12 July 2007 19:48, Chandler Carruth wrote:
> On 7/12/07, Torvald Riegel <torvald at se.inf.tu-dresden.de> wrote:
> > Here are some comments, quotes are from the draft.
> >
> > > an operation based constraint cannot guard other operations
> >
> > I think constraints associated with a particular instruction usually
> > apply to this instruction and previous/subsequent instructions, so this
> > wouldn't be true. This is the case in the atomic_ops model, and also on
> > ia64 I think.
>
> "guard other operations" means guarding operations other than the
> atomic intrinsic functions. Specifically, loads from and stores to
> shared memory are perfectly legal, and inherently atomic. However,
> they might need memory synchronization routines. Tying synchronization
> to only operations would force typing them to nearly every operation.

You can provide intrinsics for load/store (and perhaps vector stuff, or 
anything else that goes to "memory"), in the same way you do for CAS etc.
This of course assumes that everything besides memory (as from the LLVM IR 
perspective) is thread-private. 

> memory synchronization semantics. A standalone memory barrier would
> still be required (precisely as it is in hardware), but lowerings
> could use the potentially better optimized versions of the operations.
> I see this as an extension of the existing proposal, not a problem.

I know that we're discussing a detail here, but it could be important. I 
would say that you don't need standalone barriers, because barriers are 
always used to order operations that interact with shared memory. We see 
all operations that interact with memory (loads/stores/...). If we 
wouldn't, then one could argue that LLVMs view of memory is missing 
something.
For example, reading from a real-time clock could be such a case. There, you 
could want to make sure that a store is visible before you get a timestamp. 
But in this case, the clock should actually be sth like a read-only memory 
location.

> > Regarding swap(): Which uses do you have in mind? To me, support for
> > TAS would be more useful as it is used for spinlocks (and no concurrent
> > algo immediately comes to my mind that uses swap() and couldn't use
> > TAS). The difference is that TAS can have a weaker interface, because
> > it operates on a boolean value only, making it easier to map on
> > different hardware (and TAS is faster than CAS on some architectures I
> > think). For example, for TAS a byte is sufficient, whereas with swap
> > you probably would require that the exchange has machine-word size.
>
> I cannot express this enough. This is *not*, repeat *not* an API.
> Programmers will *not* be using these constructs directly. This is an
> interface to the hardware. So what is more useful to algorithms is not
> the primary consideration. Rather, what the hardware provides for
> algorithms is.

I do not agree at all. Let me explain why.
Assume I want to do implement a concurrent algorithm that essentially 
describes how instructions can be interleaved. You got thread-private 
instructions that can be ignored, so only consider the ones that are used 
to communicate between threads. These are atomic.
Up to now the set of operations (or possible invocations of operations) is 
ordered from the perspective of a single thread (so, one could assume that 
a thread sees its own previous updates), but they're not sequential 
consistent or anything. So I then add reordering constraints to enforce the 
orderings that are necessary for correctness. One can certainly argue about 
the interface here, but it will consist of load/store, test-and-set, 
get-and-set, compare-and-set, fetch-and-add/sub/inc/dec (these are the ones 
that I have seen being used in concurrent algorithms), and more that I am 
missing. The point here is that this is an API.
As a next step, I have to select implementations for the atomic ops in the 
algorithm, and somehow declare the reordering constraints.

On the other side (at the hardware), there are instructions that I can use 
and that I cannot use, and some sort of specifying barriers. Plus all the 
weird, architecture-specific things: do we need a certain alignment for 
data that atomic ops operate on? Up to which size can operands be?

Now the question for us is where to put the interface at, or how many 
interfaces do we need?

What you seem to favour is that we need the interface right at the hardware 
level (what the hardware provides). I don't like this because it doesn't 
help me in mapping my algorithmic view to hardware, especially because it 
doesn't weaken the dependency to which architecture my program is going to 
run on.
There is obviously a trade-off between 1) doing the mapping completely at 
application-level and 2) doing it at LLVM-level.

For the latter, you wouldn't need more than offer a load and a CAS (you 
can't get more universal than that). But if you do that, you're slow, and 
developers will not be able to use LLVM's atomic ops if performance matters 
(it does most of the time for the people that are doing concurrency stuff).

If you're doing the mapping at application-level, then you select 
architecture-dependent (in this case, at the front-end/preprocessor). But 
then, where is the use in having LLVM atomic ops? LLVM handles ASM quite 
well, so I can just use, e.g., the atomic_ops package.

So, IMO everything is about the API. You need to map between the abstract 
view and the hardware offerings somewhere, and you don't want to do that on 
application-level.
So we can't dodge this problem, we need to find a good spot in the 
trade-off. If we don't, then it's going to be of limited use to the people 
who actually need it.

Besides this single-interface view, one could also add several interfaces. 
For example, the one you propose closer to the hardware, and a library / 
gcc intrinsics / ... at application level. If you take this approach, then 
you could assume that the interface close to the application puts serious 
effort into selecting instructions, and that any coarsening abstraction at 
the lower interface will just lead to slowdown. If it doesn't, then we're 
back at the previous scenario.
This scenario might also cover the case in which a user compiles some sort 
of parallelism language directly to LLVM, but I'm not sure about this.

> > This also shows that you get additional overhead in the codegen if
> > barriers are not associated with an operation: To emit efficient code,
> > codegen would have to check whether membars are next to an operation,
> > and whether they can be merged with the operation. If codegens don't do
> > it, then runtime overhead will be higher due to the unnecessary
> > barriers. This also implies that there is no reordering of las() with
> > other operations until codegen phase, so you would at least have to
> > have some sort of compiler barrier, or require the las() target to be
> > volatile (does LLVM volatile ordering guarantees apply to non-volatile
> > loads/stores also, or just to volatile ones?)
>
> This is not true in any situation for any architecture but Itanium.
> Furthermore, given the specs of GCC's intrinsics, it is not true even
> for Itanium in this specific situation. The GCC specification (such
> that it is) for these intrinsics indicates that they must form a "full
> barrier". Itanium's instruction-based barriers cannot provide this
> functionality, forcing it to fall back on a full fence on either side
> of the instruction. Is this poor code? Yes it is, but it is required
> to match the semantics provided by GCC.

atomic_ops has the following generalization for IA64 CAS_full:
#  define AO_compare_and_swap_full(addr, old, new_val) \
        (AO_nop_full(), AO_compare_and_swap_acquire(addr, old, new_val))
So, only one full barrier and one "built-in", but that is not the point I 
want to make. The thing I'd like to highlight is that the best 
implementation is not straight-forward, and that you will have to combine 
membars with operations at least on some architectures in order to get good 
performance.

> On every other architecture, there _are_ no instruction-based
> barriers. x86 almost has instruction based barriers, as for some chips
> _every_ instruction is a full barrier, but this is not what you are
> aiming for either and is slowly changing in newer x86 implementations.
> For the other architectures, memory barriers _are_ separate
> instructions. x86 has separate instructions for memory barriers.

The x86 case is another example. To be fast, you would need to merge the 
full barriers with the instruction (because the instruction already does 
it...). Doing this in the backends requires quite some effort, I'd say. I'm 
not sure how this will develop in the future, though.

Torvald