[LLVMdev] Plan to optimize atomics in LLVM

Fri Aug 8 12:23:42 PDT 2014

I forgot to say that x and y are to be initially 0 for the example to work.

On Fri, Aug 8, 2014 at 12:22 PM, Robin Morisset <morisset at google.com> wrote:

> Here is an example of code on which ExpandLoadLinkedPass looks unsound to
> me (It is a variant of the SB pattern in the literature):
>
>> x, y two globals of type *i32 (atomic)
>
> Thread 0:
>>  store atomic i32 1, i32* @x seq_cst, align 4
>>  %res = cmpxchg i32* @y, i32 0, i32 0, seq_cst, seq_cst
>>  %old_y = extractvalue {i32, i1} %res, 0
>> Thread 1:
>>   store atomic i32 1, i32* @y seq_cst, align 4
>>  %res = cmpxchg i32* @x, i32 0, i32 0, seq_cst, seq_cst
>>  %old_x = extractvalue {i32, i1} %res, 0
>
>
> The pass turns the cmpxchg into (approximately, there is a bunch of
> control-flow in addition):
>
>> fence release
>> load-linked monotonic
>> store-conditional monotonic
>> fence seq_cst
>
>
> From my reading of Atomics.rst, it would be sound to reorder (It does not
> say much about load-linked, so I am treating it as a normal load here)
>
>> store seq_cst
>> fence release
>> load-linked monotonic
>
> into
>
>> load-linked monotonic
>> store seq_cst
>> fence release
>
>
> Which would make an execution ending in %old_x = %old_y = 0 possible,
> while it is impossible in the original program.
> Since some atomic stores may be turned into AtomicCmpXchg if they are too
> big, it may even occur in code with originally only atomic stores and loads.
>
> Fixing it is a two line change in insertLeadingFence, but it triggers some
> test failures, both because of tests looking specifically for a fence
> release here, or looking for a dmb ishst which is asserted to be correct
> for fence release on swift processors (I have no idea if it is correct in
> this specific scenario, I could not find documentation on the exact quirks
> of the memory model of swift, assuming it is public).
>
>
> About your other comments:
> I will look at target hooks, I had not thought about using them.
>
> I agree that some modularity is required. But currently there is lots of
> code duplication, for example RMWs are lowered into monotonic operations +
> fences in ExpandLoadLinked (as we just saw), and load/stores are lowered
> into monotonic operations + fences in SelectionDAGBuilder.. I hope to share
> some code between those by pushing things into a common ExpandAtomicsPass.
>
> Thanks,
> Robin
>
>
> On Fri, Aug 8, 2014 at 11:42 AM, Tim Northover <t.p.northover at gmail.com>
> wrote:
>
>> > I am planning in doing in IR, but with target specific-passes (such as
>> X86ExpandAtomicPass)
>> > that just share some of the code
>>
>> This would more normally be done via target hooks in LLVM, though the
>> principle is sound.
>>
>> > But it must be target-dependent as for example on Power a
>> > seq_cst store has a fence before it, while on ARM it has a fence
>> > both before and after it (per
>> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
>>
>> That certainly seems to suggest some kind of parametrisation.
>>
>> > For this exact reason, I am planning on splitting
>> AtomicExpandLoadLinkedPass
>> > in a target-independent and a target-dependent file: the current pass
>> claims
>> > to be target-independent but is actually designed for ARM: for example
>> it
>> > puts a release-fence before a seq_cst CAS, which would be unsound on
>> Power
>> > if the backend was more agressive and using lwsync for release_fences.
>>
>> I don't know the Power architecture, but this failure ought to be
>> describable in terms of LLVM's own memory model (if a valid Power
>> implementation of LLVM IR can trigger it, then the transformation
>> itself is wrong). Do you have an example execution in mind that shows
>> it?
>>
>> > Since
>> > these fences are not chosen based on the LLVM fences semantics, but on
>> the
>> > hardware memory model, I was thinking of inserting target-specific
>> > intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
>> > clearer that these passes are target-specific and unsound outside of
>> their
>> > target.
>>
>> If they *are* unsound, that should be changed immediately (and I'll
>> almost certainly make time to do so, hence my questions here). It's
>> completely unacceptable to give LLVM's "fence whatever" a
>> target-specific meaning in my opinion, even briefly.
>>
>> > Another thing I would have to move to this IR pass is the insertion of
>> > fences around atomic stores/loads when insertFencesForAtomic==true. It
>> is
>> > currently happening in SelectionDAGBuilder, which makes it impossible
>> to do
>> > fence elimination at the IR level.
>>
>> I'm a little worried about this being just one monster "do stuff with
>> atomics" pass, especially if it ends up one-per-target, but even if
>> the bulk can remain generic. I'd prefer a more compositional approach
>> if it can be managed.
>>
>> > Is it reasonable, or is there some rule against using hardware-specific
>> > intrinsics at the hardware level (or some other problem with this
>> approach)?
>>
>> Lots of the changes sound like they're going in the right direction.
>> I'd particularly pleased to see other architectures using (via
>> whatever adaptations are necessary) the atomic expansion pass; I think
>> that could significantly simplify other backends.
>>
>> I'm a little concerned about changing the "fence XYZ" conversion into
>> target intrinsics, but it looks likely it'd be necessary for
>> performance even if the current scheme does turn out to be correct so
>> I say go for it!
>>
>> Cheers.
>>
>> Tim.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/e09758a1/attachment.html>