<div dir="ltr">I forgot to say that x and y are to be initially 0 for the example to work.</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Aug 8, 2014 at 12:22 PM, Robin Morisset <span dir="ltr"><<a href="mailto:morisset@google.com" target="_blank">morisset@google.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Here is an example of code on which ExpandLoadLinkedPass looks unsound to me (It is a variant of the SB pattern in the literature):<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

x, y two globals of type *i32 (atomic)</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Thread 0:<br>


 store atomic i32 1, i32* @x seq_cst, align 4<br> %res = cmpxchg i32* @y, i32 0, i32 0, seq_cst, seq_cst<br> %old_y = extractvalue {i32, i1} %res, 0<br>Thread 1:<br>  store atomic i32 1, i32* @y seq_cst, align 4<br> %res = cmpxchg i32* @x, i32 0, i32 0, seq_cst, seq_cst<br>


 %old_x = extractvalue {i32, i1} %res, 0</blockquote><div> </div><div>The pass turns the cmpxchg into (approximately, there is a bunch of control-flow in addition):</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


fence release<br>load-linked monotonic<br>store-conditional monotonic<br>fence seq_cst</blockquote><div> </div><div>From my reading of Atomics.rst, it would be sound to reorder (It does not say much about load-linked, so I am treating it as a normal load here)</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">store seq_cst<br>fence release<br>load-linked monotonic</blockquote>


<div>into</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">load-linked monotonic<br>store seq_cst<br>

fence release</blockquote>

<div><br></div><div>Which would make an execution ending in %old_x = %old_y = 0 possible, while it is impossible in the original program.</div><div>Since some atomic stores may be turned into AtomicCmpXchg if they are too big, it may even occur in code with originally only atomic stores and loads.</div>


<div><br></div><div>Fixing it is a two line change in insertLeadingFence, but it triggers some test failures, both because of tests looking specifically for a fence release here, or looking for a dmb ishst which is asserted to be correct for fence release on swift processors (I have no idea if it is correct in this specific scenario, I could not find documentation on the exact quirks of the memory model of swift, assuming it is public). </div>


<div><br></div><div><br></div><div>About your other comments:</div><div>I will look at target hooks, I had not thought about using them.</div><div><br></div><div>I agree that some modularity is required. But currently there is lots of code duplication, for example RMWs are lowered into monotonic operations + fences in ExpandLoadLinked (as we just saw), and load/stores are lowered into monotonic operations + fences in SelectionDAGBuilder.. I hope to share some code between those by pushing things into a common ExpandAtomicsPass.</div>


<div><br></div><div>Thanks,</div><div>Robin</div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Aug 8, 2014 at 11:42 AM, Tim Northover <span dir="ltr"><<a href="mailto:t.p.northover@gmail.com" target="_blank">t.p.northover@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>> I am planning in doing in IR, but with target specific-passes (such as X86ExpandAtomicPass)<br>

</div><div>> that just share some of the code<br>

<br>

</div>This would more normally be done via target hooks in LLVM, though the<br>

principle is sound.<br>

<div><br>

> But it must be target-dependent as for example on Power a<br>

> seq_cst store has a fence before it, while on ARM it has a fence<br>

> both before and after it (per <a href="http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html" target="_blank">http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html</a>)<br>

<br>

</div>That certainly seems to suggest some kind of parametrisation.<br>

<div><br>

> For this exact reason, I am planning on splitting AtomicExpandLoadLinkedPass<br>

> in a target-independent and a target-dependent file: the current pass claims<br>

> to be target-independent but is actually designed for ARM: for example it<br>

> puts a release-fence before a seq_cst CAS, which would be unsound on Power<br>

> if the backend was more agressive and using lwsync for release_fences.<br>

<br>

</div>I don't know the Power architecture, but this failure ought to be<br>

describable in terms of LLVM's own memory model (if a valid Power<br>

implementation of LLVM IR can trigger it, then the transformation<br>

itself is wrong). Do you have an example execution in mind that shows<br>

it?<br>

<div><br>

> Since<br>

> these fences are not chosen based on the LLVM fences semantics, but on the<br>

> hardware memory model, I was thinking of inserting target-specific<br>

> intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it<br>

> clearer that these passes are target-specific and unsound outside of their<br>

> target.<br>

<br>

</div>If they *are* unsound, that should be changed immediately (and I'll<br>

almost certainly make time to do so, hence my questions here). It's<br>

completely unacceptable to give LLVM's "fence whatever" a<br>

target-specific meaning in my opinion, even briefly.<br>

<div><br>

> Another thing I would have to move to this IR pass is the insertion of<br>

> fences around atomic stores/loads when insertFencesForAtomic==true. It is<br>

> currently happening in SelectionDAGBuilder, which makes it impossible to do<br>

> fence elimination at the IR level.<br>

<br>

</div>I'm a little worried about this being just one monster "do stuff with<br>

atomics" pass, especially if it ends up one-per-target, but even if<br>

the bulk can remain generic. I'd prefer a more compositional approach<br>

if it can be managed.<br>

<div><br>

> Is it reasonable, or is there some rule against using hardware-specific<br>

> intrinsics at the hardware level (or some other problem with this approach)?<br>

<br>

</div>Lots of the changes sound like they're going in the right direction.<br>

I'd particularly pleased to see other architectures using (via<br>

whatever adaptations are necessary) the atomic expansion pass; I think<br>

that could significantly simplify other backends.<br>

<br>

I'm a little concerned about changing the "fence XYZ" conversion into<br>

target intrinsics, but it looks likely it'd be necessary for<br>

performance even if the current scheme does turn out to be correct so<br>

I say go for it!<br>

<br>

Cheers.<br>

<span><font color="#888888"><br>

Tim.<br>

</font></span></blockquote></div><br></div>

</div></div></blockquote></div><br></div>