[PATCH] D67485: AArch64: use ldp/stp for atomic & volatile 128-bit where appropriate.

Fri Sep 20 07:29:53 PDT 2019

t.p.northover added a comment.

> First, it appears that the current codegen (CAS loop) for 128-bit atomic accesses is broken based on this comment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814#c3. There are two problematic cases as far as I understand: (1) const and (2) volatile atomic objects. Const objects disallow write access to the underlying memory, volatile objects mandate that each byte of the underlying memory shall be accessed exactly once according to the AAPCS. The CAS loop violates both.

> Maybe the solution is to follow GCC here, at least for the general case (architectures prior to v8.4), meaning to expand atomic operations into a call to the appropriate __atomic library function . I believe this is what the AtomicExpandPass does in LLVM. In that case we need to provide an implementation of those functions.

I think Clang is involved there too, in horribly non-obvious ways (for example I think that's the only way to get the actual libcalls you want rather than legacy ones). Either way, that's a change that would need pretty careful coordination. Since all of our CPUs are Cyclone or above we could probably just skip the libcalls entirely at Apple without ABI breakage (which, unintentionally, is what this patch does).

> My concern about using ldp/stp is that the specification promises single-copy atomicity provided that accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory. Is there such a guarantee to the compiler? Maybe that's the default for Operating Systems but not for bare-metal? Would it make sense to enable this optimization for certain target triples, i.e. when sys != none?

I don't think anyone has written down a guarantee, but we've pretty much always assumed we're accessing reasonably normal memory. `dmb` instructions are always `ish` for example. I've never had any comments from our more embedded developers on that front (or seen anyone try to do general atomics in another realm). I suspect they go to assembly for the few spots it might matter.

> In certain code sequences where you have two consecutive atomic stores, or an atomic load followed by an atomic store you'll end up with redundant memory barriers. Is there a way to get rid of them?

ARM has a pass designed to merge adjacent barriers, though I've seen it miss some cases. We might think about porting it to AArch64, or maybe doing some work in AtomicExpansion in generic way.

> Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. If we ever implement -mtune, then a command line like clang -march=armv8a -mtune=cyclone should mean “generated correct code for the v8.0 architecture, but optimize for cyclone". Adding this feature to cyclone as is would probably result in the above command line producing code that isn’t architecturally correct for v8.0.

I don't think Clang really does anything with -mtune yet. The most it could do based on the way CPUs are implemented in LLVM now would be something like applying the scheduling model. Almost all of the features in the list are going to break older CPUs.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D67485/new/

https://reviews.llvm.org/D67485