[PATCH] D58361: [x86] allow more 128-bit extract+shufps formation to avoid 256-bit shuffles

Tue Feb 19 00:17:17 PST 2019

pcordes added a comment.

Unless we're running out of registers, a loop that can hoist the load of the shuffle-control vector would certainly benefit from vpermd / vpermps in some of these cases.

On SnB-family, `vextracti/f128` and `vinserti/f128` cost the same as YMM `vperm2f128` or `vpermq` (immediate lane-crossing shuffles), and the same as `vpermd` if we have the control vector already loaded.  This is *very* different from AMD bdver* / znver1 / jaguar, where insert/extract are *very* cheap, and we should avoid vperm2f128 whenever possible.  (In fact emulating it with insert + extract is probably a win for tune=jaguar / bdver* / znver1.  Probably not for znver2, which is expected to have 256-bit ALUs.)

It looks like some of these shuffles would be better done with a `vshufps YMM` to combine the data we want into one register, then a `vpermq YMM, imm` to reorder 64-bit chunks.  On Haswell/Skylake, that gives us 1c + 3c latency, and only 2 shuffle uops.  Much better than 2x vpermps + vinsertf128.  (And what are we doing using FP shuffles on integer vectors?  It's probably not going to cause bypass delays, but FP blends can and we use `vblendps` instead of `vpblendd`.)

If we're worried about the size of vector constants, something like `[0,2,4,6,4,6,6,7]` can easily be compressed to 8-bit elements and loaded with `vpmovzxbd` (or `sx` for negative elements).  That's fine outside a loop.  We wouldn't want an extra load+shuffle inside a loop, though.  `vpmovzx` can't micro-fuse with a YMM destination on SnB-family.

----

In some cases I think we can use `vpackusdw` (and then a lane-crossing fixup with `vpermq` immediate), but we'd have to prepare both inputs with VPAND to zero the high input bits, creating non-negative inputs for i32 -> u16 saturation.  Or `vpackuswb` for i16 -> u8.  This could let us do the equivalent of `vshufps` for narrower elements.  We can create the AND mask on the fly from `vpcmpeqd same,same` / `vpsrld ymm, 16`.   (And for AVX512 we don't need it because AVX512 has narrowing with truncation as an option instead of saturation.)

If we already need to shift, we can use `vpsrad` to create sign-extended inputs for `vpackssdw` i32 -> i16 saturation.

================
Comment at: llvm/test/CodeGen/X86/vector-trunc-widen.ll:63
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    retq
 ;
----------------
This is really bad on Haswell/Skylake, where each of these shuffles costs a port 5 uop.  Instead of 2x 128-bit vshufps, we should be doing 1x 256-bit vshufps and then doing a cross-lane fixup with `vpermq` if we have AVX2 available.

It's the same shuffle for both lanes, so it's crazy to extract and do it separately unless we're lacking AVX2.

```
 vshufps imm8   #  ymm0 = ymm1[1,3],ymm2[1,3],ymm1[5,7],ymm2[5,7]
 vpermq imm8    #  ymm0 = ymm0[0,2,1,3]
 ret
```

This should be optimal even on bdver* / jaguar / znver1 where vpermq / vpermpd is 3 uops / 2 cycles (Zen numbers).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D58361/new/

https://reviews.llvm.org/D58361