[llvm-bugs] [Bug 38173] New: AArch64 compare_exchange_weak leads to a conditional branch on a constant (and very poor branch layout)

Sun Jul 15 20:15:56 PDT 2018

https://bugs.llvm.org/show_bug.cgi?id=38173

            Bug ID: 38173
           Summary: AArch64 compare_exchange_weak leads to a conditional
                    branch on a constant (and very poor branch layout)
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

This code compile to some really dumb asm for AArch64.  (And moderately dumb
for x86-64).

#include <atomic>
#include <cstdint>

uint64_t CAS_retry(std::atomic<uint64_t>& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(!foo.compare_exchange_weak(
        expected, expected + 2,
        std::memory_order_seq_cst, std::memory_order_relaxed))
        {}

    return expected;
}

On Godbolt, with clang6.0 or trunk(337113)
 -std=gnu++11 -O3 -Wall -target aarch64-linux-android -stdlib=libc++ we get

   CAS_retry(std::__1::atomic<unsigned long long>&):
        ldr     x8, [x0]              # it doesn't optimize into a LL/SC retry
loop, but that's expected
          # first iteration of the CAS retry loop is peeled
        ldaxr   x9, [x0]
        cmp     x9, x8
        b.ne    .LBB0_3
        add     x10, x8, #2             // =2
        stlxr   w11, x10, [x0]
        cbnz    w11, .LBB0_4

        orr     w10, wzr, #0x1        ###### w10 = 0|1 = 1
        tbz     w10, #0, .LBB0_5      ###### then branch on the low bit: always
not-taken
        b       .LBB0_10              ###### jump to epilogue

.LBB0_3:
        clrex
.LBB0_4:
        mov     w10, wzr            ###### w10 = 0
        tbnz    w10, #0, .LBB0_10   ###### if(w10[0]!=0) always falls through

.LBB0_5:             # The main CAS retry loop is duplicated.
        mov     x8, x9
        ldaxr   x9, [x0]
        cmp     x9, x8
        b.ne    .LBB0_8
        add     x10, x8, #2             // =2
        stlxr   w11, x10, [x0]
        cbnz    w11, .LBB0_9
        orr     w10, wzr, #0x1
        cbz     w10, .LBB0_5           #### Again branching on a constant
        b       .LBB0_10
.LBB0_8:                                //   in Loop: Header=BB0_5 Depth=1
        clrex
.LBB0_9:                                //   in Loop: Header=BB0_5 Depth=1
        mov     w10, wzr
        cbz     w10, .LBB0_5
.LBB0_10:
        mov     x0, x8
        ret

See
https://stackoverflow.com/questions/51346815/atomically-clearing-lowest-non-zero-bit-of-an-unsigned-integer/51353947#51353947
for a more detailed commentary of the version you get from  
while(unlikely(!cas(...))) {}.

This hand-written version is what we *should* emit, if we still fail to
optimize a compare_exchange_weak loop into a LL/SC loop (if that's even legal).
 *This* version is definitely legal; it makes all the same memory accesses as
the source, and as clang's current code, but without being ridiculous.

    ldr     x9, [x0]                   @ x9 = expected = foo.load(relaxed)
 .Lcas_retry:
    ldaxr   x8, [x0]                   @ x8 = the CAS load (a=acquire,
x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .Lcas_fail
    add     x10, x9, #2
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .Lllsc_fail

    # LL/SC success falls through
#.Lepilogue:
    mov     x0, x8
    ret

.Lcas_fail:
    clrex                           @ clear the ldrx/stxr transaction
.Lllsc_fail:
    mov     x9, x8                  @ update expected
    b     .Lcas_retry           @ instead of duplicating the loop, jump back to
the existing one with x9 = expected

No taken branches in the fast path (the no-contention case)

Instead of peeling the first iteration of the while() loop, I just jump over
the function epilogue.  LL/SC fail is slow anyway, and unlikely, so putting the
setup for the next iteration out-of-line makes a lot of sense.  (Especially if
the source uses an unlikely() __builtin_expect() macro.

--------

x86: gcc does this, which looks optimal:

CAS_retry(std::atomic<unsigned long>&):
        mov     rax, QWORD PTR [rdi]
.L2:
        lea     rdx, [rax+2]
        lock cmpxchg    QWORD PTR [rdi], rdx
        jne     .L2
        ret

clang does this, with a bunch of extra MOV instructions, and peeling the loop
just requires a taken branch on the fast path.

CAS_retry(std::atomic<unsigned long>&): 
        mov     rcx, qword ptr [rdi]
        lea     rdx, [rcx + 2]
        mov     rax, rcx
        lock            cmpxchg qword ptr [rdi], rdx
        je      .LBB0_3
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     rcx, rax         #### This is especially suspect
        lea     rdx, [rcx + 2]   # this could have used rax as a source
        mov     rax, rcx         # rax = rcx from the previous wasted mov
        lock            cmpxchg qword ptr [rdi], rdx
        jne     .LBB0_1
.LBB0_3:
        mov     rax, rcx       # cmpxchg leaves expected in RAX on success
        ret

I don't think there's likely to be a branch-prediction performance advantage
from peeling the first iteration.  I guess there could be in the general case
of other loops whose iteration counts don't depend on thread timing, though.  A
loop-or-not branch separate from the main loop branch might predict better for
patterns of 0 iterations vs. 30 iterations where there's some pattern to when
it should fall through the first time.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180716/6fe735a7/attachment.html>