[llvm-bugs] [Bug 35047] New: load merging for (data[0]<<0) | (data[1]<<8) | ... endian agnostic load goes berserk with AVX2 variable-shift

Mon Oct 23 17:10:56 PDT 2017

https://bugs.llvm.org/show_bug.cgi?id=35047

            Bug ID: 35047
           Summary: load merging for (data[0]<<0) | (data[1]<<8) | ...
                    endian agnostic load goes berserk with AVX2
                    variable-shift
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: normal
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

unsigned load_le32(unsigned char *data) {
    unsigned le32 = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) |
(data[3]<<24);
    return le32;
}

// https://godbolt.org/g/X8i1pr

clang 6.0.0 (trunk 316311) -O3 -march=haswell -mno-avx

        movl    (%rdi), %eax
        retq

-O3 -march=haswell (with AVX2)

.LCPI0_0:
        .quad   16                      # 0x10
        .quad   24                      # 0x18
load_le32:                              # @load_le32
        movzbl  (%rdi), %eax
        movzbl  1(%rdi), %ecx
        shll    $8, %ecx
        vpmovzxbq       2(%rdi), %xmm0  # xmm0 =
mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
        orl     %eax, %ecx
        vpsllvq .LCPI0_0(%rip), %xmm0, %xmm0
        vmovd   %xmm0, %edx
        vpextrd $2, %xmm0, %eax
        orl     %edx, %eax
        orl     %ecx, %eax
        retq

So if vpsllvq is available, clang uses it and doesn't notice that it could have
coalesced the loads into one.  -fno-vectorize doesn't block this.  (And if the
shift counts didn't line up this way, it's quite poorly vectorized.  VPMOVZXBD
would have worked, then do 4 shifts, and then a horizontal reduction with OR,
using the same pattern as a horizontal sum.  e.g. vpunpckhqdq / vpor / vmovq /
rorx $32, %rax, %rdx / or %edx, %eax)

(And BTW, for Haswell and later,  movb 1(%rdi), %al  merges into RAX without
stalling at all.  It's a single micro-fused load+merge uop, so it's better than
a separate movzx load + OR instruction.  See  
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to)

clang 4.0.1 doesn't merge the loads.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171024/46a82e59/attachment.html>