[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Fri Jun 15 08:21:24 PDT 2018

Hexagon has a very few restrictions on what will cause loss of a 
reservation: those are stores to the same address (a 64-bit granule) or 
any sort of exception/interrupt/system call. Other than that the 
reservation should stay. The architecture doesn't explicitly guarantee 
that though, but in the absence of the elements listed above, a program 
with LL/SC can be expected to make progress.

Consequently, the best way for us to handle LL/SC would be to expand 
them early and let them be optimized as any other code. The usual 
optimization restrictions should be sufficient to prevent introduction 
of factors causing a loss of reservation.

With the constraints on LL/SC varying wildly between architectures, 
maybe we should have several options available for different targets?

-Krzysztof

On 6/13/2018 10:42 AM, Alex Bradbury wrote:
> # RFC: Atomic LL/SC loops in LLVM revisited
> 
> ## Summary
> 
> This proposal gives a brief overview of the challenges of lowering to LL/SC
> loops and details the approach I am taking for RISC-V. Beyond getting feedback
> on that work, my intention is to find consensus on moving other backends
> towards a similar approach and sharing common code where feasible. Scroll down
> to 'Questions' for a summary of the issues I think need feedback and
> agreement.
> 
> For the original discussion of LL/SC lowering, please refer to James
> Knight's 2016 thread on the topic:
> http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html
> 
> I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so
> generous with their review feedback on this atomics work so far.
> 
> ## Background: Atomics in LLVM
> 
> See the documentation for full details <https://llvm.org/docs/Atomics.html>.
> In short: LLVM defines memory ordering constraints to match the C11/C++11
> memory model (unordered, monotonic, acquire, release, acqrel, seqcst).
> These can be given as parameters to the atomic operations supported in LLVM
> IR:
> 
> * Fences with the fence instruction
> * Atomic load and store with the 'load atomic' and 'store atomic' variants of
> the load/store instructions..
> * Fetch-and-binop / read-modify-write operations through the atomicrmw
> instruction.
> * Compare and exchange via the cmpxchg instruction. Takes memory ordering for
> both success and failure cases. Can also specify a 'weak' vs 'strong' cmpxchg,
> where the weak variant allows spurious failure
> 
> ## Background: Atomics in RISC-V
> 
> For full details see a recent draft of the ISA manual
> <https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>,
> which incorporates work from the Memory Consistency Model Task Group to define
> the memory model. RISC-V implements a weak memory model.
> 
> For those not familiar, RISC-V is a modular ISA, with standard extensions
> indicated by single letters. Baseline 'RV32I' or 'RV64I' instruction sets
> don't support atomic operations beyond fences. However the RV32A and RV64A
> instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC
> (load-linked/store-conditional on other architectures). 32-bit atomic
> operations are supported natively on RV32, and both 32 and 64-bit atomic
> operations support natively on RV64.
> 
> AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour. For
> LR/SC: LR loads a word and registers a reservation on source memory address.
> SC writes the given word to the memory address and writes success (zero) or
> failure (non-zero) into the destination register. LR/SC can be used to
> implement compare-and-exchange or to implement AMOs that don't have a native
> instruction. To do so, you would typically perform LR and SC in a loop.
> However, there are strict limits on the instructions that can be placed
> between a LR and an SC while still guaranteeing forward progress:
> 
> """
> The static code for the LR/SC sequence plus the code to retry the sequence in
> case of failure must comprise at most 16 integer instructions placed
> sequentially in memory. For the sequence to be guaranteed to eventually
> succeed, the dynamic code executed between the LR and SC instructions can only
> contain other instructions from the base “I” subset, excluding loads, stores,
> backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM
> instructions. The code to retry a failing LR/SC sequence can contain backward
> jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same
> constraints.
> """
> 
> The native AMOs and LR/SC allow ordering constraints to be specified in the
> instruction. This isn't possible for load/store instructions, so fences must
> be inserted to represent the ordering constraints. 8 and 16-bit atomic
> load/store are therefore supported using 8 and 16-bit load/store plus
> appropriate fences.
> 
> See Table A.6 on page 187 in the linked specification for a mapping from C/C++
> atomic constructs to RISC-V instructions.
> 
> ## Background: Lowering atomics in LLVM
> 
> The AtomicExpandPass can help you support atomics for your taraget in a number
> of ways. e.g. inserting fences around atomic loads/stores, or converting an
> atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning
> the latter ability is problematic - there is no way to guarantee that the
> invariants for the LL/SC loop required by the target architecture will be
> maintained. This shows up most frequently when register spills are introduced
> at O0, but spills could theoretically still occur at higher optimisation
> levels and there are other potential sources of issues: inappropriate
> instruction selection, machine block placement, machine outlining (though see
> D47654 and D47655), and likely more.
> 
> I highly encourage you to read James Knight's previous post on this topic
> which goes in to much more detail about the issues handling LL/SC
> <http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The situation
> remains pretty much the same:
> 
> * ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1
> and above but use a custom post-regalloc expansion for O0
> * MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions
> which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has
> the problems described above, so MIPS is in the process of moving towards a
> two-stage lowering, with the LL/SC loop lowered after register allocation. See
> D31287 <https://reviews.llvm.org/D31287>.
> * Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass.
> 
> Lowering a word-size atomic operations to an LL/SC loop is typically trivial,
> requiring little surrounding code. Part-word atomics require additional
> shifting and masking as a word-size access is used. It would be beneficial if
> the code to generate this shifting and masking could be shared between
> targets, and if the operations that don't need to be in the LL/SC loop are
> exposed for LLVM optimisation.
> 
> The path forwards is very clearly to treat the LL/SC loop as an indivisible
> operation which is expanded as late as possible (and certainly after register
> allocation). However, there are a few ways of achieving this.
> 
> If atomic operations of a given size aren't supported, then calls should be
> created to the helper functions in libatomic, and this should be done
> consistently for all atomic operations of that size. I actually found GCC is
> buggy in that respect <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>.
> 
> ## Proposed lowering strategy (RISC-V)
> 
> Basic principles:
> * The LL/SC loop should be treated as a black box, and expanded post-RA.
> * Don't introduce intrinsics that simply duplicate existing IR instructions
> * If code can be safely expanded in the IR, do it there. [I'd welcome feedback
> on this one - should I be taking a closer look at expansion in SelectionDAG
> legalisation?]
> 
> The above can be achieved by extending AtomicExpandPass to support a 'Custom'
> expansion method, which uses a TargetLowering call to expand to custom IR,
> including an appropriate intrinsic representing the LL+SC loop.
> 
> Atomic operations are lowered in the following ways:
> 
> * Atomic load/store: Allow AtomicExpandPass to insert appropriate fences
> * Word-sized AMO supported by a native instruction: Leave the IR unchanged and
> use the normal instruction selection mechanism
> * Word-sized AMO without a native instruction: Select a pseudo-instruction
> using the normal instruction selection mechanism. This pseudo-instruction will
> be expanded after register allocation.
> * Part-word AMO without a native instruction: Shifting and masking that occurs
> outside of the LL/SC loop is expanded in the IR, and a call to a
> target-specific intrinsic to implement the LL/SC loop is inserted (e.g.
> llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a
> pseudo-instruction which is expanded after register allocation.
> * Part-word AMO without a native instruction that can be implemented by a
> native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented
> by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
> transformation.
> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
> normal instruction selection mechanism. This pseudo-instruction will be
> expanded after register allocation.
> * Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling
> llvm.riscv.masked.cmpxchg.i32.
> 
> Scratch registers for these pseudo-instructions are modelled as in ARM and
> AArch64, by specifying multiple outputs and specifying an @earlyclobber
> constraint to ensure the register allocator assigns unique registers. e.g.:
> 
> class PseudoCmpXchg
>      : Pseudo<(outs GPR:$res, GPR:$scratch),
>               (ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering), []> {
>    let Constraints = "@earlyclobber $res, at earlyclobber $scratch";
>    let mayLoad = 1;
>    let mayStore = 1;
>    let hasSideEffects = 0;
> }
> 
> Note that there are additional complications with cmpxchg such as supporting
> weak cmpxchg (which requires returning a success value), or supporting
> different failure orderings. It looks like the differentiation between
> strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now.
> Supporting only strong cmpxchg and using the success ordering for the failure
> case is conservative but correct I believe.
> 
> In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
> latest possible moment. The RISCVExpandPseudoInsts pass is registered with
> addPreEmitPass2.
> 
> The main aspect I'm unhappy with in this approach is the need to introduce new
> intrinsics. Ideally these would be documented as not for use by frontends and
> subject to future removal or alteration - is there precedent for this?
> Alternatively, see the suggestion below to introduce target-independent masked
> AMO intrinsics.
> 
> ## Alternative options
> 
> 1. Don't expand anything in IR, and lower to a single monolithic
> pseudo-instruction that is expanded at the last minute.
> 2. Don't expand anything in IR, and lower to pseudo-instructions in stages.
> Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC
> loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is
> represented by a new pseudoinstruction. This final pseudoinstruction is then
> expanded after register allocation. This minimises the possibility for sharing
> logic between backends, but does mean we don't need to expose new intrinsics.
> Mips adopts this approach in D31287.
> 3. Target-independent SelectionDAG expansion code converts unsupported atomic
> operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd,
> expand to nodes that align the address and calculate the mask as well as an
> AtomicLoadAddMasked node. I haven't looked at this in great detail.
> 4. Introducing masked atomic operations to the IR. Mentioned for completeness,
> I don't think anyone wants this.
> 5. Introduce target-independent intrinsics for masked atomic operations. This
> seems worthy of consideration.
> 
> For 1. and 2. the possibility for sharing logic between backends is minimised
> and the address calculation, masking and shifting logic is mostly hidden from
> optimisations (though option 2. allows e.g. MachineCSE). There is the
> advantage of avoiding the need for new intrinsics.
> 
> ## Patches up for review
> 
> I have patches up for review which implement the described strategy. More
> could be done to increase the potential for code reuse across targets, but I
> thought it would be worth getting feedback on the path forwards first.
> 
> * D47587: [RISCV] Codegen support for atomic operations on RV32I.
> <https://reviews.llvm.org/D47587>. Simply adds support for lowering fences and
> uses AtomicExpandPass to generate libatomic calls otherwise. Committed in
> rL334590.
> * D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.
> <https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences for
> atomic load/store. Committed in rL334591.
> * D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A.
> <https://reviews.llvm.org/D47882>. Implements the lowering strategy described
> above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR.
> Under review.
> * D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A.
> <https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with
> appropriately manipulated operands to implement 8/16-bit AMOs. Under review.
> * D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR.
> <https://reviews.llvm.org/D48130> Separated patch as this modifies the
> existing shouldExpandAtomicCmpXchgInIR interface. Under review.
> * D48141: [RISCV] Implement codegen for cmpxchg on RV32I.
> <https://reviews.llvm.org/D48131> Implements the lowering strategy described
> above. Under review.
> 
> ## Questions
> 
> To pick a few to get started:
> 
> * How do you feel about the described lowering strategy? Am I unfairly
> overlooking a SelectionDAG approach?
> * How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and
> other architectures to use such an approach?
>    * If there is enthusiasm, how worthwhile is it to share logic for generation
>    of masks+shifts needed for part-word atomics?
>    * I'd like to see ARM+AArch64+Hexagon move away from the problematic
>    expansion in IR and to have that code deleted from AtomicExpandPass. Are
>    there any objections?
> * What are your thoughts on the introduction of new target-independent
> intrinsics for masked atomics?
> 
> Many thanks for your feedback,
> 
> Alex Bradbury, lowRISC CIC
> 

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation