[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Fri Jun 15 15:03:37 PDT 2018

On Wed, Jun 13, 2018 at 11:42 AM Alex Bradbury <asb at lowrisc.org> wrote:

> ## Proposed lowering strategy (RISC-V)
>
> Basic principles:
> * The LL/SC loop should be treated as a black box, and expanded post-RA.
>

It will probably come as no surprise, but I think this is 100% required for
correctness. I'm quite skeptical that it can be correct on any existing
architecture, without doing this. My first inclination upon hearing that
some architecture may not require such careful handling is that their
architecture documentation probably failed to document the careful handling
required, not that it's not required.

However, in order to not block this on every architecture maintainer being
persuaded, it's a good idea to introduce the new functionality into
AtomicExpandPass, and switch architectures over to it as the arch
maintainers are convinced it's a good idea.

> * If code can be safely expanded in the IR, do it there. [I'd welcome
> feedback
> on this one - should I be taking a closer look at expansion in SelectionDAG
> legalisation?

Definitely in favor of IR expansion -- we should do as much in IR as is
feasible, because it admits better optimization opportunities. Only the
LL/SC-loop should be late-expanded.

> The above can be achieved by extending AtomicExpandPass to support a
> 'Custom'

expansion method, which uses a TargetLowering call to expand to custom IR,
> including an appropriate intrinsic representing the LL+SC loop

I think it'd be better to sink more of the functionality into
AtomicExpandPass itself (rather than adding a "Custom" hook). However, that
ties into whether to introduce a common intrinsic that can be used across
architectures...

> * Part-word AMO without a native instruction that can be implemented by a
> native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be
> implemented
> by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
> transformation.
>
+1, and that transform should be in common code in AtomicExpandPass.

> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
> normal instruction selection mechanism. This pseudo-instruction will be
> expanded after register allocation.
>

On RISCV, implementing the whole thing in the pseudo is probably right,
since you only really have the inner-loop.

But for other archs like ARMv7, I think it'll probably makes sense to
continue to handle a bunch of the cmpxchg expansion in IR. There, the
current cmpxchg expansion can be quite complex, but only loop really needs
to be a primitive (we'd need two loop variants, both "strex, ldrex, loop"
or "ldrex, strex, loop", depending on whether it generates an initial
ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex,
barriers--  can all remain IR-level expanded.

I'll note that in all cases, both for RISCV and ARM and others, we _really_
would like to be able to have the compare-failure jump to a different
address than success. That isn't possible for an intrinsic call at the
moment, but I believe it will be possible to make that work soon, due to
already ongoing work for "asm goto", which requires similar. Once we have
that ability, I don't see any reason why the late-lowering cmpxchg pseudo
should have any performance downside vs IR expansion, at least w.r.t. any
correct optimizations.

* Part-word compare-and-exchange: Handled similarly to part-word AMOs,
> calling
> llvm.riscv.masked.cmpxchg.i32.
>

Yep.

Note that there are additional complications with cmpxchg such as supporting
> weak cmpxchg (which requires returning a success value), or supporting
> different failure orderings. It looks like the differentiation between
> strong/weak cmpxchg doesn't survive the translation to SelectionDAG right
> now.
> Supporting only strong cmpxchg and using the success ordering for the
> failure
> case is conservative but correct I believe.
>

Yep. The distinction between strong/weak is only relevant to LL/SC
architectures, where you get to avoid making a loop. AFAIK, all ISAs which
have a native cmpxchg instruction implement the strong semantics for it,
and since the LL/SC loops are expanded in IR right now, they didn't need
the weak indicator in SelectionDAG. While strong doesn't _need_ a success
indicator output, as noted above it'll generate nicer code when we can.

In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
> latest possible moment. The RISCVExpandPseudoInsts pass is registered with
> addPreEmitPass2.
>
> The main aspect I'm unhappy with in this approach is the need to introduce
> new
> intrinsics. Ideally these would be documented as not for use by frontends
> and
> subject to future removal or alteration - is there precedent for this?
> Alternatively, see the suggestion below to introduce target-independent
> masked
> AMO intrinsics.

I agree -- it'd be great to somehow name/annotate these intrinsics, such
that it's clear that they're a private implementation detail _INSIDE_ the
llvm target codegen passes, with no stability guaranteed. Not even "opt"
passes should be emitting them, so they should never end up in a bitcode
file where we'd need to provide backwards compatibility.

Maybe we can call them something like
"llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32" (where
the random number in the middle is something that changes), and document
that nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm
CodeGen.

## Alternative options

[A bunch of alternatives I don't like, so i'm omitting them]

> 5. Introduce target-independent intrinsics for masked atomic operations.
> This
>
seems worthy of consideration.
>

I think it's definitely worthwhile looking to see if the set of intrinsics
for the atomicrmw operations (in particular, the set of additional
arguments computed in the pre-loop section, and the return value) are
likely going to be the same on the different architectures we have now. If
so, I think it's definitely worthwhile making a commonly-named intrinsic.
However, even so, it should remain for internal use only, and only
implemented on targets which tell AtomicExpandPass to generate it. I'd like
it to be common only to enhance code-reuse among the targets, not to
provide a user-facing API.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/01aca6bf/attachment.html>