[llvm-dev] Atomic LL/SC loops in llvm

Tue May 10 12:22:32 PDT 2016

So, taking PR25526 as context, and after reading the internet, it seems to me that creating an atomic LL/SC loop at the IR level -- as is the modern way to do it in LLVM -- is effectively impossible to do correctly.  Nor, for that matter, was it correct to create such a loop at isel time, as the implementation prior to r205525 did (and, as some targets still do, not having been updated yet to use IR expansion.)

In summary (see below for more details): to guarantee forward progress in an LL/SC loop, you typically must use only a subset of the instruction-set (excluding loads, stores, floating-point, branches), the entire loop must fit within some amount of contiguous memory, and it must only execute a small number of instructions.

I see no way that those properties can be guaranteed in LLVM, without emitting the entire loop as a fixed blob.

For ARM, there's the newly-added ARMExpandPseudo::ExpandCMP_SWAP code, which basically does that. But, it seems kinda ugly, and a bit horrible to have to write all that code to generate a couple instructions -- and that's only handling cmpxchg, not atomicrmw, which would need similar treatment. And, even with all that, it's still creating multiple basic blocks, prior to basic-block placement, so I think there's no guarantee that the blocks get placed contiguously. (It kinda seems like it'd be easier at this point to just expand to inline-asm...)

Plus, that's used only with -O0 at the moment. So, it generates worse code (which allowed the implementation to be simpler). It can only returns a success/fail value, instead of branching directly to success/fail successors. It doesn't handle the optimizations like skipping the barrier on early cmpxchg failure. Probably more such issues, too.

Anyone got any other ideas for how to do this, without needing to introduce a bunch of per-target code? AtomicExpandPass was nice in that the target only needed to provide a couple of intrinsics, and it generated the rest.

It's always a pain when the abstractions are leaky...

The typical constraints on an LL/SC loop -- details vary between architectures -- are that between the load-linked and store-conditional instructions you must:

  1. Not store to the region of memory ("granule") containing the address you're trying to atomically-modify. The size of the granule might be only the memory addresses being operated on, or the containing cache-line (most common), or maybe even all of memory, or anything in between. Any such store clears the load-linked reservation -- sometimes even if from the same CPU.

  2. Not cause data or instruction cache filling/eviction activity. E.g. with a direct-mapped cache, any load, or a branch to a sufficiently far-away instruction might cause a guaranteed cache-line conflict -- and that might kill the reservation on every loop iteration. With a more typical n-way cache, you'd have more leeway of course.

  3. Not cause any other architectural exceptions/traps that clear the reservation state. So, that typically excludes floating point, memory barriers, etc.

  4. Not take any branches. (This seems the rarest of the constraints; it's possible it's only relevant to Alpha and RISC-V, although maybe others just forgot to mention it as being potentially problematic.)

  5. Execute only a small number of instructions within the loop.

That last restriction seems most odd as a hard constraint, as opposed to just a performance win. It is apparently because a common implementation <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka8404.html> is for the LL instruction to pull the cache-line local, and invalidate it from all remote caches (as if the LL were actually a write). The subsequent SC will then succeed only if the cacheline has not been invalidated since the LL. With a naive implementation, LL operations on two CPUs executing the same loop might just ping-pong the cacheline back and forth between them forever, without either CPU being able to successfully execute the SC -- by the time each got to it, the cacheline will have already been invalidated by the other CPU. Delaying the remote invalidation request for a little while in this situation is sufficient to prevent that problem. (But how long is a little while?)

If you fail to adhere to the requirements, there's two kinds of issues it might cause. Firstly, it could cause the loop to deterministically never terminate, even when there's nobody else contending on the atomic. E.g. ll <stackaddr>; spill to stack; sc <stackaddr>. On some CPUs, the spill to a nearby stack address will clear the reservation. This issue is what motivated the recent ARM changes for -O0. Or, apparently on some Alpha CPUs, taking a branch deterministically kills the reservation.

More insidiously, with that same example, you can introduce the possibility of live-lock between LL/SC loops of two processors. A store from another processor necessarily clears the reservation of other CPUs (unlike an LL, which does so in some implementations, but not all). So, the problem with indefinite cacheline ping-pong can be introduced by the extra spill in the loop -- even when it appears that it doesn't cause a problem on your processor.

Here's a relevant set of emails <http://yarchive.net/comp/linux/cmpxchg_ll_sc_portability.html> from Linus, also discussing why it's not possible to reliably expose LL/SC intrinsics to C code -- which is the same problems it's not possible to reliably expose them to LLVM IR.

The MIPS IV manual is nicely specific in its documentation of what you can depend upon:

"An LL on one processor must not take action that, by itself, would cause an SC for the same block on another processor to fail." (which avoids the potential of live-lock between two ll/sc loops where one CPU's LL invalidates the other CPU's reservation before the other can get to the SC operation.)

Forward progress can then be guaranteed, unless:
"A load, store, or prefetch is executed on the processor executing the LL/SC." or
"The instructions executed starting with the LL and ending with the SC do not lie in a 2048-byte contiguous region of virtual memory.".
Also: "Exceptions between the LL and SC cause SC to fail, so persistent exceptions must be avoided. Some examples of these are arithmetic operations that trap, system calls, floating-point operations that trap or require software emulation assistance."

Similarly, the RISC-V manual -- which I mention only because it is so clear and concise about its architectural requirements -- says:
"We mandate that LR/SC sequences of bounded length (16 consecutive static instructions) will eventually succeed, provided they contain only base ISA instructions other than loads, stores, and taken branches." (their "base ISA" excludes floating point or multiply/divide instructions.)

Unfortunately, neither ARM nor PPC appear to precisely document the architectural constraints under which forward progress must be guaranteed by the implementation. They certainly have the same underlying implementation issues that give rise to the above rules -- that much seems documented -- they just don't appear to make explicit guarantees on how you can guarantee success. ARM does "recommend" that LL/SC loops fit within 128 bytes, though.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/92bd5d02/attachment.html>