[llvm-bugs] [Bug 48017] New: [AArch64] Under -O0, atomicrmw contains an extra store in the ldaxr/stlxr loop

Fri Oct 30 02:06:31 PDT 2020

https://bugs.llvm.org/show_bug.cgi?id=48017

            Bug ID: 48017
           Summary: [AArch64] Under -O0, atomicrmw contains an extra store
                    in the ldaxr/stlxr loop
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: AArch64
          Assignee: unassignedbugs at nondot.org
          Reporter: rofirrim at gmail.com
                CC: arnaud.degrandmaison at arm.com,
                    llvm-bugs at lists.llvm.org, smithp352 at googlemail.com,
                    Ties.Stuij at arm.com

Created attachment 24112
  --> https://bugs.llvm.org/attachment.cgi?id=24112&action=edit
LLVM IR snippet at -O0 (slightly simplified)

The following C++ snippet compiled under -O0

  #include <atomic>
  std::atomic<int> _value(0);
  void foo() { _value += 1; }

generates the attached IR (slightly simplified). That IR is emitted under -O0
with the usual ldaxr/stlxr loop.

$ llc -O0 -mtriple aarch64 -o - myatomic.ll
...
.LBB0_1:                                // %atomicrmw.start
                                        // =>This Inner Loop Header: Depth=1
        ldr     x10, [sp, #16]                  // 8-byte Folded Reload
        ldr     w9, [sp, #24]                   // 4-byte Folded Reload
        ldaxr   w8, [x10]
                                        // kill: def $x8 killed $w8
                                        // kill: def $w8 killed $w8 killed $x8
        str     w8, [sp, #12]                   // 4-byte Folded Spill (!!!)
        add     w9, w8, w9
        stlxr   w8, w9, [x10]
        cbnz    w8, .LBB0_1
...

When using this code in a ThunderX machine, this loop hangs.

That extra `str` instruction (which looks like a side-effect of the register
allocator) seems to make the exclusive access be lost and the code loops
forever. This might be fallout from the recent rewrite of RegAllocFast. 

Now, this is odd because:
 - That store accesses the stack while x10 is a global address, so they are far
enough that that str shouldn't make the exclusive access be lost.
 - This problem doesn't happen in all aarch64 implementations: Raspberry Pi 4
or A64FX are unaffected. We have only been able to reproduce this reliably on a
ThunderX machine.

So to be honest I'm not sure if:
 - This is a bug of that ThunderX.
 - This is a bug in LLVM.

For the latter case, the Armv8-A spec (Issue E.a of the document) says in
§B2.9.5 that:

"LoadExcl / StoreExcl loops are guaranteed to make forward progress only if,
for any LoadExcl / StoreExcl loop within a single thread of execution, the
software meets all of the following conditions:

1. Between the Load-Exclusive and the Store-Exclusive, there are no explicit
memory accesses, preloads, direct or indirect System register writes, address
translation instructions, cache or TLB maintenance instructions, exception
generating instructions, exception returns, or indirect branches"

This could suggest that that store better not be inside that loop if we want to
guarantee progress in all aarch64 implementations. However I'm no expert in
this area and perhaps that loop is OK and we're observing a problem in our
particular aarch64 implementation.

clang/llvm 11.0 is unaffected.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20201030/514e0b59/attachment.html>