[LLVMdev] Plan to optimize atomics in LLVM

Thu Aug 14 12:10:51 PDT 2014

ping.

I am currently working on such a truly target-independent atomics expansion
pass, and hope to have some patches for review by the end of the week. But
I would like to have your opinion on my example of the unsoundness of using
a leading release fence for seq_cst CAS, in order to know whether or not I
should fix it in the process.

Thanks,
Robin

On Fri, Aug 8, 2014 at 2:10 PM, Robin Morisset <morisset at google.com> wrote:

> On Fri, Aug 8, 2014 at 1:49 PM, Philip Reames <listmail at philipreames.com>
> wrote:
>
>>
>> On 08/08/2014 11:42 AM, Tim Northover wrote:
>>
>>> I am planning in doing in IR, but with target specific-passes (such as
>>>> X86ExpandAtomicPass)
>>>> that just share some of the code
>>>>
>>> This would more normally be done via target hooks in LLVM, though the
>>> principle is sound.
>>>
>>>  But it must be target-dependent as for example on Power a
>>>> seq_cst store has a fence before it, while on ARM it has a fence
>>>> both before and after it (per http://www.cl.cam.ac.uk/~
>>>> pes20/cpp/cpp0xmappings.html)
>>>>
>>> That certainly seems to suggest some kind of parametrisation.
>>>
>> An alternate way of saying this might be that both ARM and Power require
>> the store to be fenced before and after.  On Power the fence after is
>> implicit, where on ARM it is not.  (Is this actually correct?  I don't know
>> either of these models well.)
>>
>> Could you use that framing to factor the arch specific and general parts?
>>  I'd really love to have a generic barrier combine pass which can work on
>> the IR semantics independent of the architecture barrier semantics.
>
>
> More precisely, Both ARM and Power require a barrier between every store
> seq_cst and every later load seq_cst (among lots of other requirements).
> On Power the mapping achieves this by a full fence before every load
> seq_cst, whereas ARM uses a full fence after ever store seq_cst.
>
> I would also love to have a generic barrier combine pass, but I strongly
> doubt it is at all possible.
>
>
>>
>>>  Is it reasonable, or is there some rule against using hardware-specific
>>>> intrinsics at the hardware level (or some other problem with this
>>>> approach)?
>>>>
>>> Lots of the changes sound like they're going in the right direction.
>>> I'd particularly pleased to see other architectures using (via
>>> whatever adaptations are necessary) the atomic expansion pass; I think
>>> that could significantly simplify other backends.
>>>
>>> I'm a little concerned about changing the "fence XYZ" conversion into
>>> target intrinsics, but it looks likely it'd be necessary for
>>> performance even if the current scheme does turn out to be correct so
>>> I say go for it!
>>>
>> I would say there's a burden of justification that the target intrinsic
>> approach is substantially better performance wise.  This doesn't have to be
>> extensive, but something should be presented. (If the generic approach is
>> actually possible.)
>
>
> For one simple example: acquire loads on ARM that are followed by a
> dependent branch can be implemented by putting an isb fence at each target
> of the branch (I can lookup the reference for this if you want), which is
> supposedly cheaper (I am hoping to get some benchmarks on this and similar
> things soon). But all the C11 fences, including the acquire fence require a
> full dmb barrier. So it is impossible to express this optimized mapping of
> acquire loads in a target-independent way.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140814/c8774896/attachment.html>