[LLVMdev] Atomic Operation and Synchronization Proposal v2

Chandler Carruth chandlerc at gmail.com
Thu Jul 12 10:48:50 PDT 2007


On 7/12/07, Torvald Riegel <torvald at se.inf.tu-dresden.de> wrote:
> Here are some comments, quotes are from the draft.
>
> > an operation based constraint cannot guard other operations
>
> I think constraints associated with a particular instruction usually apply
> to this instruction and previous/subsequent instructions, so this wouldn't
> be true. This is the case in the atomic_ops model, and also on ia64 I
> think.

"guard other operations" means guarding operations other than the
atomic intrinsic functions. Specifically, loads from and stores to
shared memory are perfectly legal, and inherently atomic. However,
they might need memory synchronization routines. Tying synchronization
to only operations would force typing them to nearly every operation.

All that aside, as I stated in the proposal, if the _hardware_ demands
it, these atomic operations could be extended to provide built-in
memory synchronization semantics. A standalone memory barrier would
still be required (precisely as it is in hardware), but lowerings
could use the potentially better optimized versions of the operations.
I see this as an extension of the existing proposal, not a problem.

>
>
> > The single instruction constraints can, at their most flexible, constrain
> > any set of possible pairings of loads from memory and stores to memory
>
> I'm not sure about this, but can we get issues due to "special" kinds of
> data transfers (such as vector stuff, DMA, ...?). Memcpy implementations
> could be a one thing to look at.
> This kind of breaks down to how universal you want the memory model to be.

Issues? Perhaps. On the target architectures, these barriers would be
treated conservatively. *All* memory accesses are protected. If the
hardware provides atypical memory access mechanisms, it should also
provide mechanisms for protecting those. These memory barriers would
then be implemented to utilize those. The model proposed is extremely
universal, because it is extremely general. Implementations for
specialized hardware should take a conservative approach. If target
architectures need more specialized memory constraint constructs,
these could again be added incrementally in order to better support
fine tuning those architecture's performance characteristics.

>
> Regarding swap(): Which uses do you have in mind? To me, support for TAS
> would be more useful as it is used for spinlocks (and no concurrent algo
> immediately comes to my mind that uses swap() and couldn't use TAS). The
> difference is that TAS can have a weaker interface, because it operates on
> a boolean value only, making it easier to map on different hardware (and
> TAS is faster than CAS on some architectures I think). For example, for TAS
> a byte is sufficient, whereas with swap you probably would require that the
> exchange has machine-word size.

I cannot express this enough. This is *not*, repeat *not* an API.
Programmers will *not* be using these constructs directly. This is an
interface to the hardware. So what is more useful to algorithms is not
the primary consideration. Rather, what the hardware provides for
algorithms is.

Furthermore, TAS on most architectures is implemented with a swap.
Specifically, the architecture with the closest thing to a "more
efficient" test and set is x86, and the atomic_ops package, which I
did look at, uses "xchgb". This is precisely the lowering x86 would
provide for the LLVM intrinsic function "i8 @llvm.atomic.swap( i8*
%ptr, i8 swap )". Thus it makes perfect sense for LLVM to support a
"swap" intrinsic function in order to implement an API which provides
test-and-set locks. I cannot see the benefit of limiting the LLVM
representation to that of the API, when a common abstraction of how to
implement that API is available on every architecture targetted.

Lastly, most of the architectures which support and size selection at
all, support it for all of the instructions necessary to implement
these operations. 'cas' and 'swap' will lower to single byte
instructions on x86 f.ex.


> > These implementations assume a very conservative interpretation.
> > result = __sync_fetch_and_add( <ty>* ptr, <ty> value )
> >             call void @llvm.atomic.membarrier( i1 true, i1 true, i1 true,
> >               i1 true )
> > %result   = call <ty> @llvm.atomic.las( <ty>* %ptr, <ty> %value )
>
> Shouldn't you have a second membar after the las() to be very conservative
> (i.e., if las() is supposed to really be linearizable)? Otherwise, the
> effects of the las() can be reordered with respect to effects of subsequent
> instructions.

You are probably right here. It was very late, and as mentioned, the
GCC spec is extremely ambiguous on the precise semantics for these
intrinsics. I'm going to move to what I think are more appropriate for
the instructions, rather than trusting their descriptions.


> This also shows that you get additional overhead in the codegen if barriers
> are not associated with an operation: To emit efficient code, codegen would
> have to check whether membars are next to an operation, and whether they
> can be merged with the operation. If codegens don't do it, then runtime
> overhead will be higher due to the unnecessary barriers. This also implies
> that there is no reordering of las() with other operations until codegen
> phase, so you would at least have to have some sort of compiler barrier, or
> require the las() target to be volatile (does LLVM volatile ordering
> guarantees apply to non-volatile loads/stores also, or just to volatile
> ones?)

This is not true in any situation for any architecture but Itanium.
Furthermore, given the specs of GCC's intrinsics, it is not true even
for Itanium in this specific situation. The GCC specification (such
that it is) for these intrinsics indicates that they must form a "full
barrier". Itanium's instruction-based barriers cannot provide this
functionality, forcing it to fall back on a full fence on either side
of the instruction. Is this poor code? Yes it is, but it is required
to match the semantics provided by GCC.

On every other architecture, there _are_ no instruction-based
barriers. x86 almost has instruction based barriers, as for some chips
_every_ instruction is a full barrier, but this is not what you are
aiming for either and is slowly changing in newer x86 implementations.
For the other architectures, memory barriers _are_ separate
instructions. x86 has separate instructions for memory barriers.

The memory barriers, as proposed, will provide all necessary barriers
to the compiler reordering.


> If you use the other approach instead (ordering constraint attached to
> instructions), then you have to support more intrinsics, but you don't need
> this kind of analysis, and you wouldn't need compiler reordering barriers
> assuming that they are implicit for anything that carries any reordering
> constraint.
>
> I would guess that with the second approach (constraints for operations),
> codegen implementations could be actually easier because you can
> concentrate on just one operation with constraints.

As stated above, this assumes that all architectures work this way,
which they simply do not. This is a hardware interface, and as such
follows the hardware. As such, the codegen will not be easier or
harder in either of these cases.


> > result = __sync_fetch_and_or( <ty>* ptr, <ty> value )
>
> or/and/... should be added to the list of supported intrinsics at some time.
> x86 has built-in support for that and doesn't need the CAS loop.

It doesn't have built-in support. These are "fetch and *" operations.
x86 has support for an atomic "or" to memory which does not include
any fetch. x86 has "xadd" which allows a fetch and add, and fetch and
sub (through negation) without the compare and swap, however neither
"or" nor "and" are available in this fashion. Perhaps I am missing
something in the x86 instruction set, but I can't find any other way
to perform these operations. The many non-fetching atomic operations
in x86 are more geared at dispatching and joining of threads rather
than synchronizing contending threads.

-Chandler Carruth

>
>
> Torvald
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>



More information about the llvm-dev mailing list