[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
Tim Northover
t.p.northover at gmail.com
Thu May 29 04:55:52 PDT 2014
Hi,
I've been looking at improving atomicrmw & cmpxchg code more,
particularly on architectures using the load-linked/store-conditional
model.
The summary is that current expansion for cmpxchg seems to happen too
late for LLVM to make meaningful use of the opportunities it provides.
I'd like to move it earlier and express it in terms of a first-class
pair of "load linked" and "store conditional" instructions.
More details are attached and inlined below. Currently the only code I
have is for one or two AtomicExpand peepholes, so any suggestions or
alternatives would be very welcome. It's possible this is entirely the
wrong approach.
What do people think?
Cheers.
Tim.
The Problem
-----------
We now expand atomic instructions to load-linked & store-conditional intrinsics
just before ISel, but this expansion opens up more optimisation opportunities
that LLVM *would* be able to exploit if it saw the control flow earlier.
For example the return value of the C++11 and C11 compare_exchange operations is
actually whether the exchange succeeded, which leads to some common idioms in
Clang-produced IR.
>From "if(__c11_compare_exchange_strong(...))":
%loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
%success = icmp eq i32 %loaded, %oldval
br i1 %success, label %true, label %false
the control-flow here should be something like:
loop:
%loaded = load linked i32* %addr seq_cst
%trystore = icmp eq %loaded, %oldval
br i1 %trystore, label %store.cond, label %false
store.cond:
%success = store conditional i32 %newval, i32* %addr seq_cst
br i1 %success, label %true, label %loop
>From "return __c11_compare_exchange_strong(...);":
%loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
%success = icmp eq i32 %loaded, %oldval
%res = zext i1 %success to i32
ret i32 %res
here, we'd want something like:
loop:
%loaded = load linked i32* %addr seq_cst
%trystore = icmp eq %loaded, %oldval
br i1 %trystore, label %store.cond, label %false
store.cond:
%success = store conditional i32 %newval, i32* %addr seq_cst
br i1 %success, label %true, label %loop
true:
ret i32 1
false:
ret i32 0
In both cases the compare is largely redundant in LL/SC architectures. You know
by the control flow graph what its result will be, and if LLVM sees this early
enough it can produce something close to optimal code.
Unfortunately, most of LLVM has finished by the time this expansion occurs, so
we end up with redundant compares.
Now, it's certainly possible to put various peephole optimisations into the
expansion pass to cover these cases (at least), but to me that feels like
fragile duplication of effort we've already made in the mid-end.
My Proposal
-----------
I think the best way to solve this longer term is to add explicit "load linked"
and "store conditional" instructions to the LLVM IR, with equal status to
"cmpxchg" and "atomicrmw".
We could then simplify the AtomicExpandLoadLinked pass to use these, change it
to use something like a TargetTransformInfo hook instead of TargetLowering and
schedule it much earlier in the mid-end. X86 & SystemZ would simply request no
expansion happen.
This means Clang could still emit cmpxchg and atomicrmw, which closely match C &
C++ semantics, but most of LLVM would see the control-flow and operations in a
form much closer to how the targers will actually emit them.
In addition, other languages would be able to provide more general atomic
operations, if they want (slightly more complex bit twiddling seems like it
could be particularly useful in some algorithms). Current atomic instructions
cover (most of) C and C++, but nothing else.
Potential issues
----------------
1. Can invalid transformations be done on them? Separating an LL from its
corresponding SC by too much almost guarantees failure (an OS task switch
happens, clearing the monitor).
2. We'd need to document their interaction with LLVM's memory model. I'm not
sure I've seen a good happens-before model that explicitly uses LL/SC. The
common ones people came up with in the Java & C++11 standardisation process
seem to assume atomic RMW operations instead.
3. People might be assuming StoreInst produces no value and not calling
ReplaceAllUses. A separate StoreConditionalInst is one option, but that risks
the reverse, people not considering it a store when it is.
4. Not all targets can support them. There are very few LLVM instructions like
that, which is a little concerning. But not all targets can support current
atomics either.
If we make vulnerability to ABA problems target-specific in our documentation
then X86 & SystemZ could use the same trick as for min/max to implement them
generically: a load/cmpxchg loop.
Alternatives
------------
1. Enhancing MachineCSE and friends so they can optimise this.
- Works for "%old = atomicrmw OP %addr, %inc; OP %old, %inc" already.
- Cmpxchg is more difficult. It's the control-flow optimisations that are
particularly powerful in these cases.
- It's more duplicated effort between different parts of LLVM.
2. Peephole optimisations in the AtomicExpandLoadLinked pass.
- Feels like a hack, spotting particular patterns that the mid-end can
already cope with.
- Deals with toy functions well, but it's not quite clear that those are the
only options in more general code (after inlining, say).
- Potentially safest option: atomicrmw & cmpxchg preserved until
very late. Nothing's going to come in and convert valid code to invalid.
3. Moving expansion further forward but still using intrinsics.
- The intrinsics need to have fully general side effects to be correct:
effectively an "asm volatile" for optimisation purposes, which is very
heavy-handed for LLVM's other optimisations.
- Still need target hooks to create the calls, because intrinsics don't get
type lowered and so you can't write a fully generic one (e.g. an i64 ldrex
on ARM needs to return { i32, i32 }).
4. Change the cmpxchg operation to return (say) { iN, i1 } where the second
value indicates success.
- Probably good idea anyway as part of support for weak compare-exchange
operations.
- Doesn't actually help this issue much: it's mostly control-flow
optimisations that seem to be helpful, and this would leave us with phi
nodes in need of materialisation when lowered.
- Might make the peephole optimisations in 2 slightly simpler.
5. Some kind of generalisation that can cover both LL/SC and transactional
memory to avoid doing this again if and when that works out?
- I can't think of one that's both neat and practically implementable for
both.
- Suggestions welcome.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ll-sc.rst
Type: application/octet-stream
Size: 6183 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140529/813c8a8f/attachment.obj>
More information about the llvm-dev
mailing list