[llvm-bugs] [Bug 34871] New: AVX512: vpmovwb is 2 uops for only a 256b result, and thus slower than vpackusdw + vpermq

Sat Oct 7 13:35:19 PDT 2017

https://bugs.llvm.org/show_bug.cgi?id=34871

            Bug ID: 34871
           Summary: AVX512: vpmovwb is 2 uops for only a 256b result, and
                    thus slower than vpackusdw + vpermq
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

// gcc and clang both auto-vectorize this sub-optimally.
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}

https://godbolt.org/g/R639cg

clang 6.0.0 (trunk 314968) -O3 -march=skylake-avx512 -mavx512vbmi:

    .LBB2_3:                 # =>This Inner Loop Header: Depth=1
        vpsrlw  $8, (%rsi,%rax,2), %zmm0
        vpsrlw  $8, 64(%rsi,%rax,2), %zmm1
        vpsrlw  $8, 128(%rsi,%rax,2), %zmm2
        vpsrlw  $8, 192(%rsi,%rax,2), %zmm3
        vpmovwb %zmm0, (%rdi,%rax)
        vpmovwb %zmm1, 32(%rdi,%rax)
        vpmovwb %zmm2, 64(%rdi,%rax)
        vpmovwb %zmm3, 96(%rdi,%rax)
        subq    $-128, %rax
        cmpq    %rax, %r9
        jne     .LBB2_3

This looks ok, but it turns out to suck because vpmovwb is 2 ALU uops for port
5.  So this produces one 256b result per 2 shuffle uops.  (Half what we can do
with vpackuswb / vpermq.)

Also, indexed addressing modes can't stay micro-fused with AVX instructions on
Skylake, only with 2-operand destructive destinations (like `paddw (%rsi,
%rcx), %xmm0`).  vpsrlw  $8, 64(%rsi,%rcx,2), %zmm1 will almost certainly
un-laminate. 
(https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes#comment76198723_31027695)

(IACA says vpsrlw will micro-fuse with a non-indexed addressing mode, even
though the instruction uses an immediate operand as well, and that stops
micro-fusion for some vector instructions.)  With more efficient shuffles, this
could lead to a front-end bottleneck.

--------

We can produce one 512b result per 2 shuffle uops like this, in theory getting
twice the throughput of clang's version:

.Lloop:
    vpsrlw  $8, 0(%rsi), %zmm0
    vpsrlw  $8, 64(%rsi), %zmm1
    vpackuswb %zmm1, %zmm0, %zmm0            # 1 uop for a 2-input shuffle
    vpermq   %zmm7, %zmm0, %zmm0             # lane-crossing fixup for
vpackuswb
    vmovdqu64 %zmm0, (%rdi, %rdx)

    add   $(2*64), %rsi
    add   $64, %rdx          # counts up towards zero
    jnc .Lloop

This should be 7 fused-domain uops, and run at one store per 2 clocks (port0
saturated with shifts, port5 saturated with shuffles.  p23 handling 1.5 uops
per 2 clocks, since the indexed store can't use p7.  But that saves us a CMP
instruction.  With unrolling, pointer increments would be the way to go.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 for more perf-analysis
of this.

In general, AVX512 lane-crossing shuffles with an element size smaller than
32-bit are multiple uops for port5.  See
https://github.com/InstLatx64/InstLatx64/blob/master/AVX512_SKX_PortAssign_v102_PUB.ods
for SKX tput/latency from AIDA64 and from Intel, and also uop->port assignments
from IACA (Intel's static-analysis tool, not from perf counters on real
hardware).  But vpmovwb is measured at one per 2 clock throughput on real
hardware.

IACA is wrong about store micro-fusion: indexed stores can stay micro-fused (no
un-lamination) on HSW and SKX if they fuse in the first place.  (vextractf128
doesn't, but regular vmov stores do.  Presumably so does vmovdqu64, but
vmovdqu8 stores always need an ALU uop (according to Intel's docs) even without
masking.)

----------

Also, with AVX512VBMI, vpermt2b is probably a good idea.

      # shuffle control in zmm1
   .Loop
        vmovdqa64       (%rsi,%rax,2), %zmm0
        vpermt2b        64(%rsi,%rax,2), %zmm1, %zmm0
        vmovdqu64        %zmm0, (%rcx,%rax)
        addq    $64, %rax
        cmpq    %rax, %rdi   # could be avoided by indexed from the end of the
array
        jne     .Loop

Assuming that Cannonlake will implement vpermt2b as p0 + 2p5 the way SKX
implements vpermt2w, this will also run at one 512b store per 2 clocks.

It's one fewer port 0 uop, so only port5 is saturated.  vpermt2b probably can't
micro-fuse at all, so an indexed addressing mode is fine.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171007/4dd63c01/attachment.html>