[llvm-bugs] [Bug 34871] New: AVX512: vpmovwb is 2 uops for only a 256b result, and thus slower than vpackusdw + vpermq
via llvm-bugs
llvm-bugs at lists.llvm.org
Sat Oct 7 13:35:19 PDT 2017
https://bugs.llvm.org/show_bug.cgi?id=34871
Bug ID: 34871
Summary: AVX512: vpmovwb is 2 uops for only a 256b result, and
thus slower than vpackusdw + vpermq
Product: new-bugs
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Keywords: performance
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: peter at cordes.ca
CC: llvm-bugs at lists.llvm.org
// gcc and clang both auto-vectorize this sub-optimally.
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
uint8_t *end_dst = dst + bytes;
do{
*dst++ = *src++ >> 8;
} while(dst < end_dst);
}
https://godbolt.org/g/R639cg
clang 6.0.0 (trunk 314968) -O3 -march=skylake-avx512 -mavx512vbmi:
.LBB2_3: # =>This Inner Loop Header: Depth=1
vpsrlw $8, (%rsi,%rax,2), %zmm0
vpsrlw $8, 64(%rsi,%rax,2), %zmm1
vpsrlw $8, 128(%rsi,%rax,2), %zmm2
vpsrlw $8, 192(%rsi,%rax,2), %zmm3
vpmovwb %zmm0, (%rdi,%rax)
vpmovwb %zmm1, 32(%rdi,%rax)
vpmovwb %zmm2, 64(%rdi,%rax)
vpmovwb %zmm3, 96(%rdi,%rax)
subq $-128, %rax
cmpq %rax, %r9
jne .LBB2_3
This looks ok, but it turns out to suck because vpmovwb is 2 ALU uops for port
5. So this produces one 256b result per 2 shuffle uops. (Half what we can do
with vpackuswb / vpermq.)
Also, indexed addressing modes can't stay micro-fused with AVX instructions on
Skylake, only with 2-operand destructive destinations (like `paddw (%rsi,
%rcx), %xmm0`). vpsrlw $8, 64(%rsi,%rcx,2), %zmm1 will almost certainly
un-laminate.
(https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes#comment76198723_31027695)
(IACA says vpsrlw will micro-fuse with a non-indexed addressing mode, even
though the instruction uses an immediate operand as well, and that stops
micro-fusion for some vector instructions.) With more efficient shuffles, this
could lead to a front-end bottleneck.
--------
We can produce one 512b result per 2 shuffle uops like this, in theory getting
twice the throughput of clang's version:
.Lloop:
vpsrlw $8, 0(%rsi), %zmm0
vpsrlw $8, 64(%rsi), %zmm1
vpackuswb %zmm1, %zmm0, %zmm0 # 1 uop for a 2-input shuffle
vpermq %zmm7, %zmm0, %zmm0 # lane-crossing fixup for
vpackuswb
vmovdqu64 %zmm0, (%rdi, %rdx)
add $(2*64), %rsi
add $64, %rdx # counts up towards zero
jnc .Lloop
This should be 7 fused-domain uops, and run at one store per 2 clocks (port0
saturated with shifts, port5 saturated with shuffles. p23 handling 1.5 uops
per 2 clocks, since the indexed store can't use p7. But that saves us a CMP
instruction. With unrolling, pointer increments would be the way to go.
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 for more perf-analysis
of this.
In general, AVX512 lane-crossing shuffles with an element size smaller than
32-bit are multiple uops for port5. See
https://github.com/InstLatx64/InstLatx64/blob/master/AVX512_SKX_PortAssign_v102_PUB.ods
for SKX tput/latency from AIDA64 and from Intel, and also uop->port assignments
from IACA (Intel's static-analysis tool, not from perf counters on real
hardware). But vpmovwb is measured at one per 2 clock throughput on real
hardware.
IACA is wrong about store micro-fusion: indexed stores can stay micro-fused (no
un-lamination) on HSW and SKX if they fuse in the first place. (vextractf128
doesn't, but regular vmov stores do. Presumably so does vmovdqu64, but
vmovdqu8 stores always need an ALU uop (according to Intel's docs) even without
masking.)
----------
Also, with AVX512VBMI, vpermt2b is probably a good idea.
# shuffle control in zmm1
.Loop
vmovdqa64 (%rsi,%rax,2), %zmm0
vpermt2b 64(%rsi,%rax,2), %zmm1, %zmm0
vmovdqu64 %zmm0, (%rcx,%rax)
addq $64, %rax
cmpq %rax, %rdi # could be avoided by indexed from the end of the
array
jne .Loop
Assuming that Cannonlake will implement vpermt2b as p0 + 2p5 the way SKX
implements vpermt2w, this will also run at one 512b store per 2 clocks.
It's one fewer port 0 uop, so only port5 is saturated. vpermt2b probably can't
micro-fuse at all, so an indexed addressing mode is fine.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171007/4dd63c01/attachment.html>
More information about the llvm-bugs
mailing list