[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
Philip Reames
listmail at philipreames.com
Thu May 29 09:03:02 PDT 2014
I have some reservations about this proposal. I don't have anything
particularly concrete, but the idea of supporting both LL/SC and
atomicrwm in the IR concerns me from a complexity perspective. If we
find that there is significant profit in supporting both, we can, but I
would like to see documentation of significant benefit before changing
the IR.
Tim, for those of us not directly involved, could you share a selection
of bugs or other background? I'd like to read through and try to get a
better sense for the problem you're trying to solve.
Philip
On 05/29/2014 04:55 AM, Tim Northover wrote:
> Hi,
>
> I've been looking at improving atomicrmw & cmpxchg code more,
> particularly on architectures using the load-linked/store-conditional
> model.
>
> The summary is that current expansion for cmpxchg seems to happen too
> late for LLVM to make meaningful use of the opportunities it provides.
> I'd like to move it earlier and express it in terms of a first-class
> pair of "load linked" and "store conditional" instructions.
>
> More details are attached and inlined below. Currently the only code I
> have is for one or two AtomicExpand peepholes, so any suggestions or
> alternatives would be very welcome. It's possible this is entirely the
> wrong approach.
>
> What do people think?
>
> Cheers.
>
> Tim.
>
> The Problem
> -----------
>
> We now expand atomic instructions to load-linked & store-conditional intrinsics
> just before ISel, but this expansion opens up more optimisation opportunities
> that LLVM *would* be able to exploit if it saw the control flow earlier.
>
> For example the return value of the C++11 and C11 compare_exchange operations is
> actually whether the exchange succeeded, which leads to some common idioms in
> Clang-produced IR.
>
> From "if(__c11_compare_exchange_strong(...))":
> %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
> %success = icmp eq i32 %loaded, %oldval
> br i1 %success, label %true, label %false
> the control-flow here should be something like:
> loop:
> %loaded = load linked i32* %addr seq_cst
> %trystore = icmp eq %loaded, %oldval
> br i1 %trystore, label %store.cond, label %false
> store.cond:
> %success = store conditional i32 %newval, i32* %addr seq_cst
> br i1 %success, label %true, label %loop
>
> From "return __c11_compare_exchange_strong(...);":
> %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
> %success = icmp eq i32 %loaded, %oldval
> %res = zext i1 %success to i32
> ret i32 %res
> here, we'd want something like:
> loop:
> %loaded = load linked i32* %addr seq_cst
> %trystore = icmp eq %loaded, %oldval
> br i1 %trystore, label %store.cond, label %false
> store.cond:
> %success = store conditional i32 %newval, i32* %addr seq_cst
> br i1 %success, label %true, label %loop
> true:
> ret i32 1
> false:
> ret i32 0
>
> In both cases the compare is largely redundant in LL/SC architectures. You know
> by the control flow graph what its result will be, and if LLVM sees this early
> enough it can produce something close to optimal code.
>
> Unfortunately, most of LLVM has finished by the time this expansion occurs, so
> we end up with redundant compares.
>
> Now, it's certainly possible to put various peephole optimisations into the
> expansion pass to cover these cases (at least), but to me that feels like
> fragile duplication of effort we've already made in the mid-end.
>
> My Proposal
> -----------
>
> I think the best way to solve this longer term is to add explicit "load linked"
> and "store conditional" instructions to the LLVM IR, with equal status to
> "cmpxchg" and "atomicrmw".
>
> We could then simplify the AtomicExpandLoadLinked pass to use these, change it
> to use something like a TargetTransformInfo hook instead of TargetLowering and
> schedule it much earlier in the mid-end. X86 & SystemZ would simply request no
> expansion happen.
>
> This means Clang could still emit cmpxchg and atomicrmw, which closely match C &
> C++ semantics, but most of LLVM would see the control-flow and operations in a
> form much closer to how the targers will actually emit them.
>
> In addition, other languages would be able to provide more general atomic
> operations, if they want (slightly more complex bit twiddling seems like it
> could be particularly useful in some algorithms). Current atomic instructions
> cover (most of) C and C++, but nothing else.
>
> Potential issues
> ----------------
>
> 1. Can invalid transformations be done on them? Separating an LL from its
> corresponding SC by too much almost guarantees failure (an OS task switch
> happens, clearing the monitor).
> 2. We'd need to document their interaction with LLVM's memory model. I'm not
> sure I've seen a good happens-before model that explicitly uses LL/SC. The
> common ones people came up with in the Java & C++11 standardisation process
> seem to assume atomic RMW operations instead.
> 3. People might be assuming StoreInst produces no value and not calling
> ReplaceAllUses. A separate StoreConditionalInst is one option, but that risks
> the reverse, people not considering it a store when it is.
> 4. Not all targets can support them. There are very few LLVM instructions like
> that, which is a little concerning. But not all targets can support current
> atomics either.
>
> If we make vulnerability to ABA problems target-specific in our documentation
> then X86 & SystemZ could use the same trick as for min/max to implement them
> generically: a load/cmpxchg loop.
>
> Alternatives
> ------------
>
> 1. Enhancing MachineCSE and friends so they can optimise this.
> - Works for "%old = atomicrmw OP %addr, %inc; OP %old, %inc" already.
> - Cmpxchg is more difficult. It's the control-flow optimisations that are
> particularly powerful in these cases.
> - It's more duplicated effort between different parts of LLVM.
>
> 2. Peephole optimisations in the AtomicExpandLoadLinked pass.
> - Feels like a hack, spotting particular patterns that the mid-end can
> already cope with.
> - Deals with toy functions well, but it's not quite clear that those are the
> only options in more general code (after inlining, say).
> - Potentially safest option: atomicrmw & cmpxchg preserved until
> very late. Nothing's going to come in and convert valid code to invalid.
>
> 3. Moving expansion further forward but still using intrinsics.
> - The intrinsics need to have fully general side effects to be correct:
> effectively an "asm volatile" for optimisation purposes, which is very
> heavy-handed for LLVM's other optimisations.
> - Still need target hooks to create the calls, because intrinsics don't get
> type lowered and so you can't write a fully generic one (e.g. an i64 ldrex
> on ARM needs to return { i32, i32 }).
>
> 4. Change the cmpxchg operation to return (say) { iN, i1 } where the second
> value indicates success.
> - Probably good idea anyway as part of support for weak compare-exchange
> operations.
> - Doesn't actually help this issue much: it's mostly control-flow
> optimisations that seem to be helpful, and this would leave us with phi
> nodes in need of materialisation when lowered.
> - Might make the peephole optimisations in 2 slightly simpler.
>
> 5. Some kind of generalisation that can cover both LL/SC and transactional
> memory to avoid doing this again if and when that works out?
> - I can't think of one that's both neat and practically implementable for
> both.
> - Suggestions welcome.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140529/8b05f284/attachment.html>
More information about the llvm-dev
mailing list