[llvm-bugs] [Bug 34707] New: unnecessary 8-bit partial-register usage creates false dependencies.

via llvm-bugs llvm-bugs at lists.llvm.org
Fri Sep 22 11:22:20 PDT 2017


https://bugs.llvm.org/show_bug.cgi?id=34707

            Bug ID: 34707
           Summary: unnecessary 8-bit partial-register usage creates false
                    dependencies.
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

unsigned long bzhi_l(unsigned long x, unsigned c) {
    return x & ((1UL << c) - 1);
}
// https://godbolt.org/g/sBEyfd
clang 6.0.0 (trunk 313965) -xc -O3 -march=haswell -m32    or znver1

        movb    8(%esp), %al
        bzhil   %eax, 4(%esp), %eax
        retl

This is technically correct (because BZHI only looks at the low 8 bits of
src2), but horrible.  There is *no* advantage to using an 8-bit load here
instead of a 32-bit load.  Same code size, but creates a false dependency on
the old value of rax.

(znver1 definitely doesn't rename partial registers.  Intel Haswell/Skylake
don't rename low8 registers separately from the full register, unlike
Sandybridge or Core2/Nehalem. 
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to).

On Haswell and Skylake, movb  8(%esp), %al  runs at 1 per cycle, as a
micro-fused ALU+load uop.  An occasional dep-breaking xor %eax,%eax lets it
bottleneck on 2 loads per clock.

Clang seems to be very eager to only move 8 bits instead of the full register. 
Clang 3.9 fixed this for reg-reg moves (e.g. unsigned shift(unsigned x,
unsigned c) {  return x<<c; }   without BMI2), but we're still getting 8-bit
loads.  On Intel CPUs, MOVZX loads are cheaper than narrow MOV loads because
they avoid the ALU uop to merge into the destination.  (It does take an extra
code byte).  AMD CPUs may use an ALU port for MOVZX, but Intel handles it
purely in the load ports.

But anyway, when loading from 32-bit memory location, it makes no sense to load
only the low 8 bits, unless we have reason to expect it was written with
separate byte stores and we want to avoid a store-forwarding stall.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170922/b3f46fec/attachment-0001.html>


More information about the llvm-bugs mailing list