[llvm-bugs] [Bug 33109] New: x86 -m32 atomic<int64_t> stores/loads should use SSE

via llvm-bugs llvm-bugs at lists.llvm.org
Fri May 19 18:48:40 PDT 2017


            Bug ID: 33109
           Summary: x86 -m32 atomic<int64_t> stores/loads should use SSE
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

Since Pentium (P5), 64-bit loads/stores are atomic, so 32-bit code can use x87,
MMX, or SSE to take advantage of this (except with -march=i486 or older).  Such
loads/stores still have the usual acquire or release semantics.

For P6 and newer, this applies to any 64-bit or narrower store that doesn't
cross a cache-line boundary, regardless of alignment.  For Pentium P5, this
applies to naturally-aligned 64-bit loads/stores.  (See 
http://stackoverflow.com/a/36685056/224132 for extracts of the relevant
sections of Intel's x86 manual).

gcc has been doing this for years, so we can be pretty confident it's safe. 
(But gcc uses pretty horrible code for transferring data from integer to xmm
registers, creating a store-forwarding stall instead of using movd/pinsrd or
some non-SSE4 equivalent.  See


IDK if anyone cares at all about generating obsolete 32-bit x86 code, but
avoiding cmpxchg8b is such a big gain that it might be worth doing.

(Also, note that using cmpxchg8b as a load isn't even safe for data in
read-only memory.  gcc7 no longer inlines cmpxchg16b, even with -mcx16, so they
can maybe handle 16-byte objects differently in the future by changing
libatomic.  And is_lock_free is false, even though the current implementation
still is lock-free, it's just not fast.  See

Anyway, for example,

int64_t load64(std::atomic<int64_t> *p) {
  return p->load(std::memory_order_acquire) + 1;

currently compiles to (clang -O3 -march=nehalem):

        pushl   %ebx
        pushl   %esi
        movl    12(%esp), %esi
        xorl    %eax, %eax
        xorl    %edx, %edx
        xorl    %ecx, %ecx
        xorl    %ebx, %ebx
        lock            cmpxchg8b       (%esi)
        addl    $1, %eax
        adcl    $0, %edx
        popl    %esi
        popl    %ebx

First, even for i486 with cmpxchg8b, it could avoid save/restore of esi:  As
long as we keep new = expected so the CAS doesn't change the value if it does
happen to match, we can put the address in one of the implicitly-used
registers.  So for example load `p` into edx and use mov %edx, %ecx instead of
xor-zeroing ecx.


But what we really want for normal 32-bit code is:

        movl    4(%esp), %eax
        movq    (%eax), %xmm0  # 64-bit atomic load, with acquire semantics

        pcmpeqd %xmm1,%xmm1
        psubq   %xmm1,%xmm0    # xmm0 -= -1, since we can generate that
constant with 1 ALU uop, while +1 would also take a psrlq $63, %xmm1.

        movd    %xmm0, %eax
        pextrd  $1, %xmm0, %edx

Or without SSE4 for PEXTRD, probably a PSRLDQ by 4 bytes, or a PSHUFLW to
copy+shuffle if we want to avoid destroying the int64_t in an xmm reg while
extracting to integer.  (PSHUFLW is faster than PSHUFD on K8 and Intel
pre-Conroe, and identical performance on all more recent CPUs).

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for more about different
possible strategies for xmm->int and int->xmm, and about the benefits of doing
64-bit integer ops in vector regs, especially when we want the data there

Note that ADC is 2 uops on Intel pre-Broadwell, so add/adc is worth avoiding. 
Variable-count integer shifts (without BMI2) are also 3 uops on SnB-family.

You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170520/9ea6567b/attachment.html>

More information about the llvm-bugs mailing list