[llvm-bugs] [Bug 40685] New: [5, 6, 7, 8, 9 regression] auto-vectorization unpacks, repacks, and unpacks to 32-bit again for count += (bool_arr[i]==0) for boolean array, using 3x the shuffles needed

Sun Feb 10 22:51:25 PST 2019

https://bugs.llvm.org/show_bug.cgi?id=40685

            Bug ID: 40685
           Summary: [5,6,7,8,9 regression] auto-vectorization unpacks,
                    repacks, and unpacks to 32-bit again for count +=
                    (bool_arr[i]==0) for boolean array, using 3x the
                    shuffles needed
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance, regression
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org

int count(const bool *visited, int len) {
    int counter = 0;

    for(int i=0;i<100;i++) {  // len unused or not doesn't matter
        if (visited[i]==0)
            counter++;
    }
    return counter;
}

(adapted from:
https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a)

I expected compilers not to notice that byte elements wouldn't overflow (and
make code that unpacks to dword inside the loop), and probably to fail to use
psadbw to hsum bytes inside a loop.  (ICC does that, gcc and MSVC just go
scalar.)

But I didn't expect clang to pack back down to bytes with pshufb after PXOR,
before redoing the expansion to dword with another PMOVZX.  (This is a
regression from clang4.0.1)

https://godbolt.org/z/1SEmTu

# clang version 9.0.0 (trunk 353629) on Godbolt
# -O3 -Wall -march=haswell -fno-unroll-loops -mno-avx
count(bool const*, int):
        pxor    xmm0, xmm0
        xor     eax, eax
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = [1,1,1,1]
        movdqa  xmm2, xmmword ptr [rip + .LCPI0_1] # xmm2 =
<0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        pmovzxbd        xmm3, dword ptr [rdi + rax] # xmm3 =
mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
        pxor    xmm3, xmm1
        pshufb  xmm3, xmm2
        pmovzxbd        xmm3, xmm3      # xmm3 =
xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero
        paddd   xmm0, xmm3

        add     rax, 4
        cmp     rax, 100
        jne     .LBB0_1
       ... horizontal sum

Unrolling just repeats this pattern
-march=haswell -mno-avx is basically the same.  -march=haswell *with* AVX2 does
slightly better, only unpacking to 16-bit elements in an XMM before repacking,
otherwise it would have needed a lane-crossing byte shuffle to pack back to
bytes for vpmovzxbd ymm, xmm.

So it looks like something really wants to fill up a whole XMM before flipping
bits with PXOR, instead of just flipping packed bits in an XMM with high
garbage.  If you're going to unpack though, you might as well just flip
unpacked booleans so you can load with pmovzx.  movd + pxor would be worse,
especially on CPUs other than Intel SnB-family where an indexed addressing mode
for pmovzx saves front-end bandwidth vs. a separate load.

The pshufb + 2nd pmovzxbd can literally be removed with zero change to the
result, because xmm1 = set1_epi32(1).

        pmovzxbd  xmm3, dword ptr [rdi + rax]    ; un-laminates on SnB
including HSW/SKL
        pxor      xmm3, xmm1
        paddd     xmm0, xmm3

Of course, avoiding a non-indexed addressing mode would also be a good thing
when tuning for Haswell.  Clang/LLVM still use indexed for -march=haswell,
costing an extra uop from un-lamination (pmovzx destination is write-only, so
it always unlaminates an indexed addressing mode.  vpmovzx can't micro-fuse
with a ymm destination, but it can with an xmm destination.)

We could also consider unpacking against zero with punpcklbw / hbw to feed 2x
punpcklwd / hwd, but that saves PXOR instructions and load uops at the cost of
more shuffle uops (6 instead of 4 to get 4 dword vectors).

--------------

This changed between Clang 4.0.1 and clang 5.0:

    # clang4.0.1 inner loop

        pmovzxbd        xmm3, dword ptr [rdi + rax] # xmm3 =
mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
        pxor    xmm3, xmm1         # ^= set1_epi32(1)
        pand    xmm3, xmm2         # &= set1_epi32(255)
        paddd   xmm0, xmm3

This is less bad (3x the shuffle-port bottleneck on Haswell/Skylake), so this
is a regression.

----

## Other missed optimizations

reporting separately, will link the bug number here for LLVM's failure to
efficiently sum 8-bit elements with PSADBW and so on.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190211/0715adb9/attachment.html>