[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Sun Jun 17 05:21:56 PDT 2018

On 15 June 2018 at 23:03, James Y Knight via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> However, in order to not block this on every architecture maintainer being
> persuaded, it's a good idea to introduce the new functionality into
> AtomicExpandPass, and switch architectures over to it as the arch
> maintainers are convinced it's a good idea.

Indeed, even if everyone agreed this was a good idea I wasn't
expecting to do this all at once.

>> The above can be achieved by extending AtomicExpandPass to support a
>> 'Custom'
>>
>> expansion method, which uses a TargetLowering call to expand to custom IR,
>> including an appropriate intrinsic representing the LL+SC loop
>
>
> I think it'd be better to sink more of the functionality into
> AtomicExpandPass itself (rather than adding a "Custom" hook). However, that
> ties into whether to introduce a common intrinsic that can be used across
> architectures...

Yes, I'd like to do more in AtomicExpandPass. Adding the 'Custom' hack
was the easiest way of prototyping this, and this thread will
hopefully give good guidance on the level of interest in using this in
a target-independent way.

>> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using
>> the
>> normal instruction selection mechanism. This pseudo-instruction will be
>> expanded after register allocation.
>
>
> On RISCV, implementing the whole thing in the pseudo is probably right,
> since you only really have the inner-loop.
>
> But for other archs like ARMv7, I think it'll probably makes sense to
> continue to handle a bunch of the cmpxchg expansion in IR. There, the
> current cmpxchg expansion can be quite complex, but only loop really needs
> to be a primitive (we'd need two loop variants, both "strex, ldrex, loop" or
> "ldrex, strex, loop", depending on whether it generates an initial
> ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex,
> barriers--  can all remain IR-level expanded.

Good point. As you say, the RISC-V expansion is much more
straight-forward. Although the barrier could be cleared eagerly after
compare failure by an SC to a dummy memory location, I don't currently
intend to do so:
1) GCC also doesn't intend to use such an expansion
2) No existing microarchitectural implementations have been shown to
benefit from this manual eager reservation clearing
3) Sticking to the simplest expansion is a good starting point, and
future microarchitects are most likely to optimise for code that is
out in the wild

> I'll note that in all cases, both for RISCV and ARM and others, we _really_
> would like to be able to have the compare-failure jump to a different
> address than success. That isn't possible for an intrinsic call at the
> moment, but I believe it will be possible to make that work soon, due to
> already ongoing work for "asm goto", which requires similar. Once we have
> that ability, I don't see any reason why the late-lowering cmpxchg pseudo
> should have any performance downside vs IR expansion, at least w.r.t. any
> correct optimizations.

I've seen periodically recurring discussions, but is someone actually
actively working on this?

>> The main aspect I'm unhappy with in this approach is the need to introduce
>> new
>> intrinsics. Ideally these would be documented as not for use by frontends
>> and
>> subject to future removal or alteration - is there precedent for this?
>> Alternatively, see the suggestion below to introduce target-independent
>> masked
>> AMO intrinsics.
>
>
> I agree -- it'd be great to somehow name/annotate these intrinsics, such
> that it's clear that they're a private implementation detail _INSIDE_ the
> llvm target codegen passes, with no stability guaranteed. Not even "opt"
> passes should be emitting them, so they should never end up in a bitcode
> file where we'd need to provide backwards compatibility.
>
> Maybe we can call them something like
> "llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32" (where the
> random number in the middle is something that changes), and document that
> nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm CodeGen.

llvm.internal_use_only.masked.atomicrmw.add.i32 would get the point
across I think.

Is it not possible someone would generate a .bc after AtomicExpandPass
has run? Of course even now there's no guarantee such a file might
work on a future version of LLVM. e.g. the atomics lowering strategy
could change from one release to the next.

>> ## Alternative options
>
> [A bunch of alternatives I don't like, so i'm omitting them]
>
>>
>> 5. Introduce target-independent intrinsics for masked atomic operations.
>> This
>>
>> seems worthy of consideration.
>
>
> I think it's definitely worthwhile looking to see if the set of intrinsics
> for the atomicrmw operations (in particular, the set of additional arguments
> computed in the pre-loop section, and the return value) are likely going to
> be the same on the different architectures we have now. If so, I think it's
> definitely worthwhile making a commonly-named intrinsic. However, even so,
> it should remain for internal use only, and only implemented on targets
> which tell AtomicExpandPass to generate it. I'd like it to be common only to
> enhance code-reuse among the targets, not to provide a user-facing API.

Yes, giving the impression of providing a new user-facing API is my
concern. Particularly as we might define want to have a comprehensive
set of intrinsics but have targets support only the subset they
require for comprehensives atomics support (as some instructions may
go through the usual SDISel route without transformation to
intrinsics).

A common set of intrinsics is probably ok, but there are cases where
you might want a different interface. For instance clearing a
reservation requires an SC to a dummy address on RISC-V. Although we
have no intention of doing this in the RISC-V backend currently,
targets that wanted to implement such an approach might want to have
that dummy address as an argument to the intrinsic.

Best,

Alex