[LLVMdev] Proposal for atomic and synchronization instructions

Tue Jul 10 09:21:14 PDT 2007

On Monday 09 July 2007 22:36, Chandler Carruth wrote:
> > > > "While the processor may spin and attempt the atomic operation more
> > > > than once before it is successful, research indicates this is
> > > > extremely uncommon." I don't understand this sentence, what do you
> > > > mean?
> > >
> > > I'm not sure I can pinpoint the paper from which the statement is
> > > based, but I seem to recall something similar in the original LL-SC
> > > papers (Maurice Herlihy, DEC Western Research Labs?) It's a
> > > foundation for lock-free algorithms.
> >
> > Well, the statement says that often you have low contention. But that's
> > something you want, not necessarily something you will get, and depends
> > on the workload/algorithm. I'm missing the context. Is the actual
> > statement as obvious as that you should try to use the atomic
> > instructions offered by your processor, instead of doing blocking
> > algorithms?
>
> LL/SC is not a blocking algorithm. I'm going to be changing some of
> the nomenclature on the page to reflect this, but while it spins, it
> does not actually lock. The idea (as I understand it, and I'm still
> hoping Scott can find the reference he gave me to a DEC paper
> outlining this) is that the spin only occurs if another process does
> something breaking atomicity for that _particular_ LL/SC pairing. Even
> when the spin occurs, it should only occur until that particular
> process gets through the op without interruption. This needs some
> statistical analysis however, and hopefully the research in the
> literature can be located and referenced.

Okay, now I know what you wanted to say (and btw, I know that). But I still 
wouldn't put that into the proposal because we don't care how the hardware 
implements it. CAS could be implemented the same way. The point is whether 
you got something as powerful as CAS or LL/SC that is still nonblocking (or 
looks nonblocking).

I'd suggest having a look at the so-called consensus hierarchy (and why 
TAS/get-and-set is less powerful than CAS).

>
> > > > You probably don't need to require CAS (compare-and-set) to return
> > > > the previous value (I think some architectures don't), but just
> > > > return a boolean value (success/failure).
> > >
> > > compare and swap?
> >
> > Well, do you need the swap, or is a compare-and-set sufficient most of
> > the time? What do other architectures offer?
>
> All of the architectures offer swap, or have no penalty for swapping.
> This allows much easier algorithm development to my mind, as you have
> the best of both worlds -- success/failure information, and the value
> from memory.

Here is an example what can happen if you want to support an architecture 
that does only return a success flag on CAS or LLSC, but not the value (no 
swap).

1) What you could do to return the value is returning a load before or after 
the CAS. Without blocking, you can't (easily?) make the load atomic with 
the operation.

2) Suppose a user wants to use CAS for a simple spin-lock (because you don't 
offer TAS, for example). You need to return a value instead of the success 
boolean flag. Neither the previous nor the later load are useful. So you 
have to pick a "different" value (e.g., expected value + 1, or sth like 
that).

3) Now user want to use CAS for something else, for example, swinging a 
pointer. How do you pick a different value? You can do, but then the user 
can't use this value for anything else.

As a result, you can't support the CAS for this hardware without blocking. 
Because of that, I would choose the conservative way of returning just the 
success flag. And most of the concurrent implementations I've seen have the 
load in it anyway (sth like while(1) { load; set up; if (CAS) break;}). 

The overhead that you can get is one more load, if you need to retry. 
However, I don't think that is an issue, both regarding ease of use and 
performance-wise.
You usually won't need a membar for this load, because some value from after 
the CAS is sufficient in most algorithms, and the CAS usually ensures that. 
If the CAS failed and your CAS is a swap, then the new value is already 
either in the cache (load is cheap) or in a register. You could try to 
supply subsequent loads from the register, but that might be difficult to 
implement in the code generator. In any case, do you have performed 
measurements to test whether the potential additional load actually matters 
performance-wise?

> > What you would want is to have a model that is (1) easy-to-use for the
> > developers and (2) close to what the hardware offers. L/S membars are
> > easy to use, but I think some architectures such as Itanium offer
> > different membars with different costs. So if you pick the wrong model
> > and have to use stronger membars (mfence Itanium) to implement your
> > model, than you pay for that by decreased performance.
>
> Itanium was the only architecture to offer these semantics, while the
> L/S membars are offered to varying levels of detail on several
> architectures. As Itanium is not yet a fully functional target, it was
> not prioritized.
>
> Moreover, as the only instructions (to my knowledge) on Itanium to
> have memory synchronization components are cmpxchg and fetchadd, these
> could be implemented correctly when implementing the lowering for the
> instructions in this proposal, while still providing full memory
> barriers when needed outside of the atomic instructions. If there is
> serious demand for building memory semantics into the atomic
> instructions, "aquire" and "release" flags could be used, and
> implementations appropriately handle them. This doesn't seem to anull
> the need for non-operation-based memory barriers.

At the programming level, the membar is always conceptually associated to 
some operation. You need the ordering at some point in your algorithm (when 
the processor would execute a particular operation). If you wouldn't need 
ordering constraints for any operation, then you wouldn't need the membar 
at all.

> The implementations for the current proposal came from architecture
> manuals, the Linux kernel, and the Apache Portable Runtime. I will
> definitely be looking at the atomic_ops implementations to see if
> there are improvements that can be made, but ultimately this provides
> another model, but not a re-usable component as this must be done
> through codegen at some point.

To me, letting codegen emit the same instructions as the macros do _is_ 
reuse. 
You might even be able to write a script that generates codegen-code from 
the macros...

>
> > Second, I guess there has been some serious effort put into selecting
> > the specific model. So, for example, if you look at some of Hans'
> > published slides etc., there are some arguments in favor of associating
> > membars with specific instructions. Do you know reasons why LLVM
> > shouldn't do this?
>
> My reason for not associating them is due to the majority of hardware
> implementations not associating them. The override motive was to
> remain very close to the hardware. Could libraries and intrinsic
> functions in the FE provide these different interfaces to the
> constructs?

It might be more difficult because the membars modify operations, and can't 
be easily "added" as new instructions,calls, or intrinsics.

Regarding the frontends, we're back at the language integration question, I 
guess. If you don't want to change syntax or add modifiers, I guess the 
best way would be to provide a function call interface for CAS, loads, ... 
(either plus membars or with membars attached as in atomic_ops), and 
then "inline"/lower the code to the to-be-added LLVM IR constructs in the 
frontend, or in an extra pass. You might even get a fallback to library 
code (requires lib to be linked, of course), when the codegen doesn't 
support the atomic ops for a certain architecture.

>
> > Has anyone looked at the memory models that are being in discussion for
> > C/C++? Although there is no consensus yet AFAIK, it should be good for
> > LLVM to stay close.
>
> Not that LLVM should shun C/C++, but those aren't its only languages.
> I think its better to approach the problem from the hardware, than the
> language. This keeps LLVM an accurate layer for expressing hardware
> operations, and allows languages to translate their constructs to
> appropriately use the hardware.

I don't know any language besides Java that does have an official memory 
model? Are there any?
In all other languages I've seen so far, synchronization is either only via 
critical sections, or they revert to using some C/C++ library (that then 
uses asm code).

> I absolutely agree. This is why every aspect of the current proposal
> came from a hardware representation, combined with the Linux kernel
> representations. We deviated toward the hardware to ensure the ability
> of these intrinsics to fully exploit and expose the hardware
> capabilities while remaining lowerable (if that were a word) across
> all targets.
>
> Does this clarify some? I am quite open to trying to add support for
> Itanium-style hardware representations if this is a significant issue,
> it was simply not a priority, and not a problem that I well
> understand. (The Linux kernel does not use these semantics that I
> could find, but then it may not be the best example.)

I really do thank you for the effort you put into this. All that I want is 
that LLVM gets the best option.

The whole memory-model issue is really difficult, and if you do it wrong, 
you're going to pay. Just look at Java and its memory-model troubles...
For this reason, I'd like to see more discussion about it, about 
alternatives, models, implementations, etc. And the decisions should be 
documented somewhere (e.g., by archives of mailing lists...).

torvald