[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Wed Jun 13 08:42:47 PDT 2018

# RFC: Atomic LL/SC loops in LLVM revisited

## Summary

This proposal gives a brief overview of the challenges of lowering to LL/SC
loops and details the approach I am taking for RISC-V. Beyond getting feedback
on that work, my intention is to find consensus on moving other backends
towards a similar approach and sharing common code where feasible. Scroll down
to 'Questions' for a summary of the issues I think need feedback and
agreement.

For the original discussion of LL/SC lowering, please refer to James
Knight's 2016 thread on the topic:
http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html

I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so
generous with their review feedback on this atomics work so far.

## Background: Atomics in LLVM

See the documentation for full details <https://llvm.org/docs/Atomics.html>.
In short: LLVM defines memory ordering constraints to match the C11/C++11
memory model (unordered, monotonic, acquire, release, acqrel, seqcst).
These can be given as parameters to the atomic operations supported in LLVM
IR:

* Fences with the fence instruction
* Atomic load and store with the 'load atomic' and 'store atomic' variants of
the load/store instructions..
* Fetch-and-binop / read-modify-write operations through the atomicrmw
instruction.
* Compare and exchange via the cmpxchg instruction. Takes memory ordering for
both success and failure cases. Can also specify a 'weak' vs 'strong' cmpxchg,
where the weak variant allows spurious failure

## Background: Atomics in RISC-V

For full details see a recent draft of the ISA manual
<https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>,
which incorporates work from the Memory Consistency Model Task Group to define
the memory model. RISC-V implements a weak memory model.

For those not familiar, RISC-V is a modular ISA, with standard extensions
indicated by single letters. Baseline 'RV32I' or 'RV64I' instruction sets
don't support atomic operations beyond fences. However the RV32A and RV64A
instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC
(load-linked/store-conditional on other architectures). 32-bit atomic
operations are supported natively on RV32, and both 32 and 64-bit atomic
operations support natively on RV64.

AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour. For
LR/SC: LR loads a word and registers a reservation on source memory address.
SC writes the given word to the memory address and writes success (zero) or
failure (non-zero) into the destination register. LR/SC can be used to
implement compare-and-exchange or to implement AMOs that don't have a native
instruction. To do so, you would typically perform LR and SC in a loop.
However, there are strict limits on the instructions that can be placed
between a LR and an SC while still guaranteeing forward progress:

"""
The static code for the LR/SC sequence plus the code to retry the sequence in
case of failure must comprise at most 16 integer instructions placed
sequentially in memory. For the sequence to be guaranteed to eventually
succeed, the dynamic code executed between the LR and SC instructions can only
contain other instructions from the base “I” subset, excluding loads, stores,
backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM
instructions. The code to retry a failing LR/SC sequence can contain backward
jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same
constraints.
"""

The native AMOs and LR/SC allow ordering constraints to be specified in the
instruction. This isn't possible for load/store instructions, so fences must
be inserted to represent the ordering constraints. 8 and 16-bit atomic
load/store are therefore supported using 8 and 16-bit load/store plus
appropriate fences.

See Table A.6 on page 187 in the linked specification for a mapping from C/C++
atomic constructs to RISC-V instructions.

## Background: Lowering atomics in LLVM

The AtomicExpandPass can help you support atomics for your taraget in a number
of ways. e.g. inserting fences around atomic loads/stores, or converting an
atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning
the latter ability is problematic - there is no way to guarantee that the
invariants for the LL/SC loop required by the target architecture will be
maintained. This shows up most frequently when register spills are introduced
at O0, but spills could theoretically still occur at higher optimisation
levels and there are other potential sources of issues: inappropriate
instruction selection, machine block placement, machine outlining (though see
D47654 and D47655), and likely more.

I highly encourage you to read James Knight's previous post on this topic
which goes in to much more detail about the issues handling LL/SC
<http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The situation
remains pretty much the same:

* ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1
and above but use a custom post-regalloc expansion for O0
* MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions
which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has
the problems described above, so MIPS is in the process of moving towards a
two-stage lowering, with the LL/SC loop lowered after register allocation. See
D31287 <https://reviews.llvm.org/D31287>.
* Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass.

Lowering a word-size atomic operations to an LL/SC loop is typically trivial,
requiring little surrounding code. Part-word atomics require additional
shifting and masking as a word-size access is used. It would be beneficial if
the code to generate this shifting and masking could be shared between
targets, and if the operations that don't need to be in the LL/SC loop are
exposed for LLVM optimisation.

The path forwards is very clearly to treat the LL/SC loop as an indivisible
operation which is expanded as late as possible (and certainly after register
allocation). However, there are a few ways of achieving this.

If atomic operations of a given size aren't supported, then calls should be
created to the helper functions in libatomic, and this should be done
consistently for all atomic operations of that size. I actually found GCC is
buggy in that respect <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>.

## Proposed lowering strategy (RISC-V)

Basic principles:
* The LL/SC loop should be treated as a black box, and expanded post-RA.
* Don't introduce intrinsics that simply duplicate existing IR instructions
* If code can be safely expanded in the IR, do it there. [I'd welcome feedback
on this one - should I be taking a closer look at expansion in SelectionDAG
legalisation?]

The above can be achieved by extending AtomicExpandPass to support a 'Custom'
expansion method, which uses a TargetLowering call to expand to custom IR,
including an appropriate intrinsic representing the LL+SC loop.

Atomic operations are lowered in the following ways:

* Atomic load/store: Allow AtomicExpandPass to insert appropriate fences
* Word-sized AMO supported by a native instruction: Leave the IR unchanged and
use the normal instruction selection mechanism
* Word-sized AMO without a native instruction: Select a pseudo-instruction
using the normal instruction selection mechanism. This pseudo-instruction will
be expanded after register allocation.
* Part-word AMO without a native instruction: Shifting and masking that occurs
outside of the LL/SC loop is expanded in the IR, and a call to a
target-specific intrinsic to implement the LL/SC loop is inserted (e.g.
llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a
pseudo-instruction which is expanded after register allocation.
* Part-word AMO without a native instruction that can be implemented by a
native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented
by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
transformation.
* Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
normal instruction selection mechanism. This pseudo-instruction will be
expanded after register allocation.
* Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling
llvm.riscv.masked.cmpxchg.i32.

Scratch registers for these pseudo-instructions are modelled as in ARM and
AArch64, by specifying multiple outputs and specifying an @earlyclobber
constraint to ensure the register allocator assigns unique registers. e.g.:

class PseudoCmpXchg
    : Pseudo<(outs GPR:$res, GPR:$scratch),
             (ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering), []> {
  let Constraints = "@earlyclobber $res, at earlyclobber $scratch";
  let mayLoad = 1;
  let mayStore = 1;
  let hasSideEffects = 0;
}

Note that there are additional complications with cmpxchg such as supporting
weak cmpxchg (which requires returning a success value), or supporting
different failure orderings. It looks like the differentiation between
strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now.
Supporting only strong cmpxchg and using the success ordering for the failure
case is conservative but correct I believe.

In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
latest possible moment. The RISCVExpandPseudoInsts pass is registered with
addPreEmitPass2.

The main aspect I'm unhappy with in this approach is the need to introduce new
intrinsics. Ideally these would be documented as not for use by frontends and
subject to future removal or alteration - is there precedent for this?
Alternatively, see the suggestion below to introduce target-independent masked
AMO intrinsics.

## Alternative options

1. Don't expand anything in IR, and lower to a single monolithic
pseudo-instruction that is expanded at the last minute.
2. Don't expand anything in IR, and lower to pseudo-instructions in stages.
Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC
loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is
represented by a new pseudoinstruction. This final pseudoinstruction is then
expanded after register allocation. This minimises the possibility for sharing
logic between backends, but does mean we don't need to expose new intrinsics.
Mips adopts this approach in D31287.
3. Target-independent SelectionDAG expansion code converts unsupported atomic
operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd,
expand to nodes that align the address and calculate the mask as well as an
AtomicLoadAddMasked node. I haven't looked at this in great detail.
4. Introducing masked atomic operations to the IR. Mentioned for completeness,
I don't think anyone wants this.
5. Introduce target-independent intrinsics for masked atomic operations. This
seems worthy of consideration.

For 1. and 2. the possibility for sharing logic between backends is minimised
and the address calculation, masking and shifting logic is mostly hidden from
optimisations (though option 2. allows e.g. MachineCSE). There is the
advantage of avoiding the need for new intrinsics.

## Patches up for review

I have patches up for review which implement the described strategy. More
could be done to increase the potential for code reuse across targets, but I
thought it would be worth getting feedback on the path forwards first.

* D47587: [RISCV] Codegen support for atomic operations on RV32I.
<https://reviews.llvm.org/D47587>. Simply adds support for lowering fences and
uses AtomicExpandPass to generate libatomic calls otherwise. Committed in
rL334590.
* D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.
<https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences for
atomic load/store. Committed in rL334591.
* D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A.
<https://reviews.llvm.org/D47882>. Implements the lowering strategy described
above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR.
Under review.
* D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A.
<https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with
appropriately manipulated operands to implement 8/16-bit AMOs. Under review.
* D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR.
<https://reviews.llvm.org/D48130> Separated patch as this modifies the
existing shouldExpandAtomicCmpXchgInIR interface. Under review.
* D48141: [RISCV] Implement codegen for cmpxchg on RV32I.
<https://reviews.llvm.org/D48131> Implements the lowering strategy described
above. Under review.

## Questions

To pick a few to get started:

* How do you feel about the described lowering strategy? Am I unfairly
overlooking a SelectionDAG approach?
* How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and
other architectures to use such an approach?
  * If there is enthusiasm, how worthwhile is it to share logic for generation
  of masks+shifts needed for part-word atomics?
  * I'd like to see ARM+AArch64+Hexagon move away from the problematic
  expansion in IR and to have that code deleted from AtomicExpandPass. Are
  there any objections?
* What are your thoughts on the introduction of new target-independent
intrinsics for masked atomics?

Many thanks for your feedback,

Alex Bradbury, lowRISC CIC