[LLVMdev] Plan to optimize atomics in LLVM

Fri Aug 8 12:22:51 PDT 2014

Here is an example of code on which ExpandLoadLinkedPass looks unsound to
me (It is a variant of the SB pattern in the literature):

> x, y two globals of type *i32 (atomic)

Thread 0:
>  store atomic i32 1, i32* @x seq_cst, align 4
>  %res = cmpxchg i32* @y, i32 0, i32 0, seq_cst, seq_cst
>  %old_y = extractvalue {i32, i1} %res, 0
> Thread 1:
>   store atomic i32 1, i32* @y seq_cst, align 4
>  %res = cmpxchg i32* @x, i32 0, i32 0, seq_cst, seq_cst
>  %old_x = extractvalue {i32, i1} %res, 0

The pass turns the cmpxchg into (approximately, there is a bunch of
control-flow in addition):

> fence release
> load-linked monotonic
> store-conditional monotonic
> fence seq_cst

>From my reading of Atomics.rst, it would be sound to reorder (It does not
say much about load-linked, so I am treating it as a normal load here)

> store seq_cst
> fence release
> load-linked monotonic

into

> load-linked monotonic
> store seq_cst
> fence release

Which would make an execution ending in %old_x = %old_y = 0 possible, while
it is impossible in the original program.
Since some atomic stores may be turned into AtomicCmpXchg if they are too
big, it may even occur in code with originally only atomic stores and loads.

Fixing it is a two line change in insertLeadingFence, but it triggers some
test failures, both because of tests looking specifically for a fence
release here, or looking for a dmb ishst which is asserted to be correct
for fence release on swift processors (I have no idea if it is correct in
this specific scenario, I could not find documentation on the exact quirks
of the memory model of swift, assuming it is public).

About your other comments:
I will look at target hooks, I had not thought about using them.

I agree that some modularity is required. But currently there is lots of
code duplication, for example RMWs are lowered into monotonic operations +
fences in ExpandLoadLinked (as we just saw), and load/stores are lowered
into monotonic operations + fences in SelectionDAGBuilder.. I hope to share
some code between those by pushing things into a common ExpandAtomicsPass.

Thanks,
Robin

On Fri, Aug 8, 2014 at 11:42 AM, Tim Northover <t.p.northover at gmail.com>
wrote:

> > I am planning in doing in IR, but with target specific-passes (such as
> X86ExpandAtomicPass)
> > that just share some of the code
>
> This would more normally be done via target hooks in LLVM, though the
> principle is sound.
>
> > But it must be target-dependent as for example on Power a
> > seq_cst store has a fence before it, while on ARM it has a fence
> > both before and after it (per
> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
>
> That certainly seems to suggest some kind of parametrisation.
>
> > For this exact reason, I am planning on splitting
> AtomicExpandLoadLinkedPass
> > in a target-independent and a target-dependent file: the current pass
> claims
> > to be target-independent but is actually designed for ARM: for example it
> > puts a release-fence before a seq_cst CAS, which would be unsound on
> Power
> > if the backend was more agressive and using lwsync for release_fences.
>
> I don't know the Power architecture, but this failure ought to be
> describable in terms of LLVM's own memory model (if a valid Power
> implementation of LLVM IR can trigger it, then the transformation
> itself is wrong). Do you have an example execution in mind that shows
> it?
>
> > Since
> > these fences are not chosen based on the LLVM fences semantics, but on
> the
> > hardware memory model, I was thinking of inserting target-specific
> > intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
> > clearer that these passes are target-specific and unsound outside of
> their
> > target.
>
> If they *are* unsound, that should be changed immediately (and I'll
> almost certainly make time to do so, hence my questions here). It's
> completely unacceptable to give LLVM's "fence whatever" a
> target-specific meaning in my opinion, even briefly.
>
> > Another thing I would have to move to this IR pass is the insertion of
> > fences around atomic stores/loads when insertFencesForAtomic==true. It is
> > currently happening in SelectionDAGBuilder, which makes it impossible to
> do
> > fence elimination at the IR level.
>
> I'm a little worried about this being just one monster "do stuff with
> atomics" pass, especially if it ends up one-per-target, but even if
> the bulk can remain generic. I'd prefer a more compositional approach
> if it can be managed.
>
> > Is it reasonable, or is there some rule against using hardware-specific
> > intrinsics at the hardware level (or some other problem with this
> approach)?
>
> Lots of the changes sound like they're going in the right direction.
> I'd particularly pleased to see other architectures using (via
> whatever adaptations are necessary) the atomic expansion pass; I think
> that could significantly simplify other backends.
>
> I'm a little concerned about changing the "fence XYZ" conversion into
> target intrinsics, but it looks likely it'd be necessary for
> performance even if the current scheme does turn out to be correct so
> I say go for it!
>
> Cheers.
>
> Tim.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/c4060604/attachment.html>