[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Thu Jun 14 04:32:11 PDT 2018

On 14 June 2018 at 10:28, Tim Northover via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>   * I'd like to see ARM+AArch64+Hexagon move away from the problematic
>>   expansion in IR and to have that code deleted from AtomicExpandPass. Are
>>   there any objections?
>
> I think it would be a great shame and I'd like to avoid it if at all
> possible, though I'm also uncomfortable with the current situation.

Thanks for the reply Tim. It's definitely fair to highlight that ARM
and AArch64 has few documented restrictions on code within an LL/SC
loop when compared to e.g. RISC-V or MIPS. I'm not sure about Hexagon.

> The problem with late expansion is that even the simplest variants add
> some rather unpleasant pseudo-instructions, and I've never even seen
> anyone attempt to get good CodeGen out of it (simplest example being
> "add xD, xD, #1" in the loop for an increment but there are obviously
> more). Doing so would almost certainly involve duplicating a lot of
> basic arithmetic instructions into AtomicRMW variants.

Let's separate the concerns here:
1) Quality of codegen within the LL/SC loop
  * What is the specific concern here? The LL/SC loop contains a very
small number of instructions, even for the masked atomicrmw case. Are
you worried about an extra arithmetic instruction or two? Sub-optimal
control-flow? Something else?
2) Number of new pseudo-instructions which must be introduced
  * You'd need new pseudos for each word-sized atomicrmw which expands
to an ll/sc loop, and an additional one for the masked form of the
operation. You could reduce the number of pseudos by taking the
AtomicRMWInst::BinOp as a parameter. The code to map the atomic op to
the appropriate instruction is tedious but straight-forward.

> Added to that is the fact that actual CPU implementations are often a
> lot less restrictive about what can go into a loop than is required
> (for example even spilling is fine on ours as long as the atomic
> object is not in the same cache-line; I suspect that's normal). That
> casts the issue in a distinctly theoretical light -- we've been doing
> this for years and as far as I'm aware nobody has ever hit the issue
> in the real world, or even had CodeGen go wrong in a way that *could*
> do so outside the -O0 situation.

That's true as long as the "Exclusives Reservation Granule" == 1
cache-line and you don't deterministically cause the reservation to be
cleared in some other way: e.g. repeatable geenrating a conflict miss
or triggering a trap etc. I don't think any Arm cores ship with
direct-mapped caches so I'll admit this is unlikely.

The possibility for issues increases if the Exclusives Reservation
Granule is larger. For the Cortex-M4, the ERG is the entire address
range <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100166_0001_00_en/ric1417175928887.html>.
In that case, spills will surely clear the reservation.

There also seem to be documented and very strict forward progress
constraints for ARMv8-M. See
https://static.docs.arm.com/ddi0553/ah/DDI0553A_h_armv8m_arm.pdf p207

"""
Forward progress can only be made using LoadExcl/StoreExcl loops if,
for any LoadExcl/StoreExcl loop within a single thread of execution if
both of the following are true:
• There are no explicit memory accesses, pre-loads, direct or indirect
register writes, cache maintenance instructions, SVC instructions, or
exception returns between the Load-Exclusive and the Store-Exclusive.
• The following conditions apply between the Store-Exclusive having
returned a fail result and the retry of the Load-Exclusive:
  – There are no stores to any location within the same Exclusives
reservation granule that the StoreExclusive is accessing.
  – There are no direct or indirect register writes, other than
changes to the flag fields in APSR or FPSCR, caused by data processing
or comparison instructions.
  – There are no direct or indirect cache maintenance instructions,
SVC instructions, or exception returns
"""

Of course it also states that the upper limit for the Exclusives
Reservation Granule is 2048 bytes, but the Cortex-M33 has an ERG of
the entire address range
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100230_0002_00_en/jfa1443092906126.html>
so something doesn't quite add up...

> OTOH that *is* an argument for performance over correctness when you
> get right down to it, so I'm not sure I can be too forceful about it.
> At least not without a counter-proposal to restore guaranteed
> correctness.

I suppose a machine-level pass could at least scan for any intervening
loads/stores in an LL/SC loop and check some other invariants, then
warn/fail if they occur. As you point out, this would be conservative
for many Arm implementations.

Best,

Alex