[LLVMdev] Plan to optimize atomics in LLVM

Fri Aug 8 13:49:58 PDT 2014

On 08/08/2014 11:42 AM, Tim Northover wrote:
>> I am planning in doing in IR, but with target specific-passes (such as X86ExpandAtomicPass)
>> that just share some of the code
> This would more normally be done via target hooks in LLVM, though the
> principle is sound.
>
>> But it must be target-dependent as for example on Power a
>> seq_cst store has a fence before it, while on ARM it has a fence
>> both before and after it (per http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
> That certainly seems to suggest some kind of parametrisation.
An alternate way of saying this might be that both ARM and Power require 
the store to be fenced before and after.  On Power the fence after is 
implicit, where on ARM it is not.  (Is this actually correct?  I don't 
know either of these models well.)

Could you use that framing to factor the arch specific and general 
parts?  I'd really love to have a generic barrier combine pass which can 
work on the IR semantics independent of the architecture barrier semantics.
>
>> For this exact reason, I am planning on splitting AtomicExpandLoadLinkedPass
>> in a target-independent and a target-dependent file: the current pass claims
>> to be target-independent but is actually designed for ARM: for example it
>> puts a release-fence before a seq_cst CAS, which would be unsound on Power
>> if the backend was more agressive and using lwsync for release_fences.
> I don't know the Power architecture, but this failure ought to be
> describable in terms of LLVM's own memory model (if a valid Power
> implementation of LLVM IR can trigger it, then the transformation
> itself is wrong). Do you have an example execution in mind that shows
> it?
>
>> Since
>> these fences are not chosen based on the LLVM fences semantics, but on the
>> hardware memory model, I was thinking of inserting target-specific
>> intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
>> clearer that these passes are target-specific and unsound outside of their
>> target.
> If they *are* unsound, that should be changed immediately (and I'll
> almost certainly make time to do so, hence my questions here). It's
> completely unacceptable to give LLVM's "fence whatever" a
> target-specific meaning in my opinion, even briefly.
I strongly agree with Tim's point here.  A latent bug is still a bug.
>
>> Another thing I would have to move to this IR pass is the insertion of
>> fences around atomic stores/loads when insertFencesForAtomic==true. It is
>> currently happening in SelectionDAGBuilder, which makes it impossible to do
>> fence elimination at the IR level.
> I'm a little worried about this being just one monster "do stuff with
> atomics" pass, especially if it ends up one-per-target, but even if
> the bulk can remain generic. I'd prefer a more compositional approach
> if it can be managed.
>
>> Is it reasonable, or is there some rule against using hardware-specific
>> intrinsics at the hardware level (or some other problem with this approach)?
> Lots of the changes sound like they're going in the right direction.
> I'd particularly pleased to see other architectures using (via
> whatever adaptations are necessary) the atomic expansion pass; I think
> that could significantly simplify other backends.
>
> I'm a little concerned about changing the "fence XYZ" conversion into
> target intrinsics, but it looks likely it'd be necessary for
> performance even if the current scheme does turn out to be correct so
> I say go for it!
I would say there's a burden of justification that the target intrinsic 
approach is substantially better performance wise.  This doesn't have to 
be extensive, but something should be presented. (If the generic 
approach is actually possible.)

Philip