[llvm-bugs] [Bug 40685] New: [5, 6, 7, 8, 9 regression] auto-vectorization unpacks, repacks, and unpacks to 32-bit again for count += (bool_arr[i]==0) for boolean array, using 3x the shuffles needed
via llvm-bugs
llvm-bugs at lists.llvm.org
Sun Feb 10 22:51:25 PST 2019
https://bugs.llvm.org/show_bug.cgi?id=40685
Bug ID: 40685
Summary: [5,6,7,8,9 regression] auto-vectorization unpacks,
repacks, and unpacks to 32-bit again for count +=
(bool_arr[i]==0) for boolean array, using 3x the
shuffles needed
Product: new-bugs
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Keywords: performance, regression
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: peter at cordes.ca
CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org
int count(const bool *visited, int len) {
int counter = 0;
for(int i=0;i<100;i++) { // len unused or not doesn't matter
if (visited[i]==0)
counter++;
}
return counter;
}
(adapted from:
https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a)
I expected compilers not to notice that byte elements wouldn't overflow (and
make code that unpacks to dword inside the loop), and probably to fail to use
psadbw to hsum bytes inside a loop. (ICC does that, gcc and MSVC just go
scalar.)
But I didn't expect clang to pack back down to bytes with pshufb after PXOR,
before redoing the expansion to dword with another PMOVZX. (This is a
regression from clang4.0.1)
https://godbolt.org/z/1SEmTu
# clang version 9.0.0 (trunk 353629) on Godbolt
# -O3 -Wall -march=haswell -fno-unroll-loops -mno-avx
count(bool const*, int):
pxor xmm0, xmm0
xor eax, eax
movdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = [1,1,1,1]
movdqa xmm2, xmmword ptr [rip + .LCPI0_1] # xmm2 =
<0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
.LBB0_1: # =>This Inner Loop Header: Depth=1
pmovzxbd xmm3, dword ptr [rdi + rax] # xmm3 =
mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
pxor xmm3, xmm1
pshufb xmm3, xmm2
pmovzxbd xmm3, xmm3 # xmm3 =
xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero
paddd xmm0, xmm3
add rax, 4
cmp rax, 100
jne .LBB0_1
... horizontal sum
Unrolling just repeats this pattern
-march=haswell -mno-avx is basically the same. -march=haswell *with* AVX2 does
slightly better, only unpacking to 16-bit elements in an XMM before repacking,
otherwise it would have needed a lane-crossing byte shuffle to pack back to
bytes for vpmovzxbd ymm, xmm.
So it looks like something really wants to fill up a whole XMM before flipping
bits with PXOR, instead of just flipping packed bits in an XMM with high
garbage. If you're going to unpack though, you might as well just flip
unpacked booleans so you can load with pmovzx. movd + pxor would be worse,
especially on CPUs other than Intel SnB-family where an indexed addressing mode
for pmovzx saves front-end bandwidth vs. a separate load.
The pshufb + 2nd pmovzxbd can literally be removed with zero change to the
result, because xmm1 = set1_epi32(1).
pmovzxbd xmm3, dword ptr [rdi + rax] ; un-laminates on SnB
including HSW/SKL
pxor xmm3, xmm1
paddd xmm0, xmm3
Of course, avoiding a non-indexed addressing mode would also be a good thing
when tuning for Haswell. Clang/LLVM still use indexed for -march=haswell,
costing an extra uop from un-lamination (pmovzx destination is write-only, so
it always unlaminates an indexed addressing mode. vpmovzx can't micro-fuse
with a ymm destination, but it can with an xmm destination.)
We could also consider unpacking against zero with punpcklbw / hbw to feed 2x
punpcklwd / hwd, but that saves PXOR instructions and load uops at the cost of
more shuffle uops (6 instead of 4 to get 4 dword vectors).
--------------
This changed between Clang 4.0.1 and clang 5.0:
# clang4.0.1 inner loop
pmovzxbd xmm3, dword ptr [rdi + rax] # xmm3 =
mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
pxor xmm3, xmm1 # ^= set1_epi32(1)
pand xmm3, xmm2 # &= set1_epi32(255)
paddd xmm0, xmm3
This is less bad (3x the shuffle-port bottleneck on Haswell/Skylake), so this
is a regression.
----
## Other missed optimizations
reporting separately, will link the bug number here for LLVM's failure to
efficiently sum 8-bit elements with PSADBW and so on.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190211/0715adb9/attachment.html>
More information about the llvm-bugs
mailing list