[PATCH] D50074: [X86][AVX2] Prefer VPBLENDW +VPBLENDD to VPBLENDVB for v16i16 blend shuffles

Tue Aug 28 22:56:38 PDT 2018

pcordes added a comment.

This is looking pretty good as far as just the optimization you're aiming for.  Most of my comments are on only semi-related things that happen to be in the diffs.  e.g. we end up picking `vpermw` now instead of blending.

================
Comment at: test/CodeGen/X86/insertelement-ones.ll:312
+; AVX512-NEXT:    retq
   %1 = insertelement <16 x i16> %a, i16 -1, i32 0
   %2 = insertelement <16 x i16> %1, i16 -1, i32 6
----------------
We should have another testcase that blends something other than -1, because the best way to do this blend with three `-1` elements is with an OR.  OR with -1 produces -1 regardless of the previous contents, and OR with 0 is a no-op, thus it's a blend.

I don't have numbers on loading a constant vs. a couple extra uops outside a loop.  Obviously any time we have a loop that will either keep a constant hot in cache, or let us hoist into a reg, this is a very nice win.

https://godbolt.org/z/JNv5VZ shows that this works: a manually optimized version of the function gives the same result for constant-propagation.

        vpor   .LCPI1_0(%rip), %ymm0, %ymm0
        ret

clang actually used `vorps`, but that can only run on port 5 before Skylake.  I used `-march=haswell`, so compiling `_mm256_or_si256` to `vorps` (port 5) instead of `vpor` (port 0/1/5) is really silly for an integer vector.  (SKL lets `vorps` run on any port, with latency between FP instructions dependent on which port it actually picks.  But I compiled with `-march=haswell`, and this is a poor choice for HSW.

Without AVX, `por` is 1 byte longer than `orps`, but even then `por` is can be worth it on pre-Skylake depending on the surrounding code (port 5 pressure, and/or if there's any ILP for this blend).  Also with Hyperthreading, uops that can be assigned to any port are more likely to be able to take full advantage of the extra ILP exposed by SMT, vs. potentially having both threads together bottleneck on the same port.

================
Comment at: test/CodeGen/X86/vector-shuffle-256-v32.ll:436
+; AVX512VLBW-NEXT:    vpermw %ymm0, %ymm1, %ymm0
+; AVX512VLBW-NEXT:    vpshufb {{.*#+}} ymm0 = ymm0[0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
 ; AVX512VLBW-NEXT:    retq
----------------
`vpermw` costs 2 shuffle uops and 4c latency on SKX, so it's implemented internally as a lane-crossing + in-lane shuffle.  Some future CPU might make it single-uop, though.

If we need a `vpshufb` anyway, can we use a wider-granularity shuffle like `vpermd` (single uop), using a vector constant there?  I guess immediate `vpermq` isn't super helpful.

================
Comment at: test/CodeGen/X86/vector-shuffle-256-v32.ll:437
-; AVX512VLBW-NEXT:    vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]
-; AVX512VLBW-NEXT:    movw $1, %ax
-; AVX512VLBW-NEXT:    kmovd %eax, %k1
----------------
`movw $1, %ax` has a false dependency on the old value of RAX.  Saving 1 byte of code size is probably not worth it, unless benchmarks show that the code density to help the uop-cache 
/ front-end is more important than back-end false deps.  But the false-dep effect can just randomly happen to suck a lot (last use of RAX = cache miss or long dep chain), vs. negligible, so it's hard to bench.

(But at least this doesn't cause a decode stall, even though the operand-size prefix changes the length of the *rest* of the instruction because of the imm16.  `mov` is exempt from LCP stalls on Sandybridge-family.)

I've noticed clang using narrow operand sizes (like 8-bit) in other cases when it's not gaining anything, and potentially hurting.

(TODO: report this as a separate bug; it's obviously unrelated to this.)

================
Comment at: test/CodeGen/X86/vector-shuffle-256-v32.ll:440
-; AVX512VLBW-NEXT:    vmovdqu16 %ymm0, %ymm1 {%k1}
-; AVX512VLBW-NEXT:    vpshufb {{.*#+}} ymm0 = ymm1[0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
 ; AVX512VLBW-NEXT:    retq
----------------
Could we have used merge-masking on the `vpshufb  control, %ymm0, %ymm1{%k1}` with a mask of `eax=3` (2 bytes) instead of `eax=1` (1 word)?  (I'm falling asleep here, so I can't tell right now.)

Repository:
  rL LLVM

https://reviews.llvm.org/D50074