[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Fri Jun 15 16:20:20 PDT 2018

On Thu, Jun 14, 2018 at 5:28 AM Tim Northover <t.p.northover at gmail.com>
wrote:

> >   * I'd like to see ARM+AArch64+Hexagon move away from the problematic
> >   expansion in IR and to have that code deleted from AtomicExpandPass.
> Are
> >   there any objections?
>
> I think it would be a great shame and I'd like to avoid it if at all
> possible, though I'm also uncomfortable with the current situation.
>
> The problem with late expansion is that even the simplest variants add
> some rather unpleasant pseudo-instructions, and I've never even seen
> anyone attempt to get good CodeGen out of it (simplest example being
> "add xD, xD, #1" in the loop for an increment but there are obviously
> more). Doing so would almost certainly involve duplicating a lot of
> basic arithmetic instructions into AtomicRMW variants.
>

I think it would be potentially-useful, and difficult, to also add pseudos
for the atomicrmw operations with an immediate operand of the range allowed
by the corresponding arithmetic operation. That doesn't seem like it'd be
difficult. But I rather doubt whether any further refinement beyond that
would be useful in practice.

For example, sure -- ARM has shifted-add-register operations. And, an "add
r2, r2, r1, lsl #2" could potentially be (but isn't currently!) generated
inside the ll/sc loop for something like:
 %shl = shl i32 %incr, 2
 %0 = atomicrmw add i32* %i, i32 %shl seq_cst
...but the alternative is simply to generate an lsl instruction before the
loop. I'm quite skeptical whether the ability to omit the 'lsl' would
really be worth caring about. Plus we aren't even doing it now. :)

> Added to that is the fact that actual CPU implementations are often a
> lot less restrictive about what can go into a loop than is required
> (for example even spilling is fine on ours as long as the atomic
> object is not in the same cache-line; I suspect that's normal). That
> casts the issue in a distinctly theoretical light -- we've been doing
> this for years and as far as I'm aware nobody has ever hit the issue
> in the real world, or even had CodeGen go wrong in a way that *could*
> do so outside the -O0 situation

So, there's two separate issues --
1. Deterministic failure of a loop even if it's the only thing running. For
example, from register spilling to the same cacheline within the loop. As
you note, that has been observed, but only at -O0.
2. Livelock between two such loops on different CPUs. AFAIK this has not
been reported, but I expect it would be very difficult to observe, because
one CPU will likely receive an OS interrupt eventually, and thus break out
of the livelock situation. This would likely exhibit simply as worse
performance than expected, which I expect would be very difficult to track
down and report.

I'd suspect that generally, most of the violations of the architectural
requirements would only cause #2, not #1. It seems entirely possible to me
that this change may well be good both for correctness and
performance-under-load.

> OTOH that *is* an argument for performance over correctness when you

get right down to it, so I'm not sure I can be too forceful about it.
> At least not without a counter-proposal to restore guaranteed
> correctness.

I'd rather look at it from the other direction -- does using a late-lowered
atomicrmw pseudo _actually_ cause any observable performance degradation? I
rather suspect it won't.

Now -- cmpxchg, as mentioned before, is trickier, and I could see that the
inability to jump to a separate failure block may make a noticeable
difference at the moment.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/a69dcffa/attachment-0001.html>