[PATCH] D40215: [X86][AVX512] Use PACKSS/PACKUS for vXi16->vXi8 truncations without BWI.

Sat Dec 30 21:26:28 PST 2017

pcordes added a comment.

KNL has slow `vpackwb ymm`.  It doesn't look optimal for most of these cases: vpmovzxwd / vpmovdb looks better.  For AVX512F without BW+DQ, we should probably go ahead and use it, i.e. assume that AVX2 instructions are fast.

Also note that KNL has faster vpmovzx than vpmovsx.

================
Comment at: test/CodeGen/X86/avx512-ext.ll:1725
+; KNL-NEXT:    vpcmpeqw %ymm2, %ymm0, %ymm0
+; KNL-NEXT:    vpacksswb %ymm1, %ymm0, %ymm0
+; KNL-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
----------------
AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for `pshufb`).  This sequence would make sense for AVX512F `tune=generic` (because it's very good on SKX and presumably future mainstream CPUs with good AVX2 support), but definitely *not* for `-march=knl`.

`vpacksswb xmm` is fast: 1 uop / 1c throughput / 2-6c latency, but the YMM versions of vpack / vpunpck (except for DQ and QDQ) are 5 uops / 9c throughput.

In this case: vpacksswb ymm + vpermq ymm = 5 + 1 = 6 shuffle uops, and maybe 10c throughput (9 + 1 assuming they all compete for the same execution resources and can't pipeline with each other).

2x vpmovzxwd y->z + 2x vpmovdb z->x + vinserti128 x->y = 5 shuffle uops, throughput = 2x2 + 2x1 + 1 = 7 cycles (with no decode stalls from multi-uop instructions).  The extra ILP probably doesn't help at all because it appears there's only one shuffle execution unit (on FP0).  So it's not *much* better, but avoiding the decode bottleneck should allow much better out-of-order execution and probably hyperthreading friendliness.

================
Comment at: test/CodeGen/X86/avx512-ext.ll:1728
-; KNL-NEXT:    vpmovsxwd %ymm1, %zmm1
-; KNL-NEXT:    vpmovdb %zmm1, %xmm1
-; KNL-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
----------------
`vpmovsx` (all forms) is 2 uops on KNL, vs. 1 for `vpmovzx` (all element / register sizes).  This is a big deal for the front-end (2c throughput vs. 7-8c throughput).  If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX.

If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables.

================
Comment at: test/CodeGen/X86/avx512-trunc.ll:568
+; KNL-NEXT:    vextracti128 $1, %ymm0, %xmm1
+; KNL-NEXT:    vpackuswb %xmm1, %xmm0, %xmm0
 ; KNL-NEXT:    vzeroupper
----------------
This is a win:  two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is definitely better than `vpmovzx` (1 uop / 2c throughput) / `vpmovdb` (1 uop / 1c throughput).

And vpmovSX is 2 uops, 7-8c throughput (decode bottleneck), so the original was horrible because of the missed vpmovzx optimization, but the vpackuswb version is still better than that because it's only using XMM registers.

================
Comment at: test/CodeGen/X86/vector-compare-results.ll:320
 ; AVX512F-NEXT:    vpcmpgtw %ymm1, %ymm0, %ymm0
-; AVX512F-NEXT:    vpmovsxwd %ymm0, %zmm0
-; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
+; AVX512F-NEXT:    vextracti128 $1, %ymm0, %xmm1
+; AVX512F-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
----------------
delena wrote:
> You are inserting AVX2 instructions instead of AVX-512, right? If yes, the prev code is better, since we have more registers in AVX-512.
In most cases that's hopefully minor compared to the shuffle throughput gain from using vextracti128 / vpackss xmm (both 1c throughput on KNL).

vpmovsx is 2 uops on KNL, so it's a big missed optimization to use it instead of vpmovzx, but even vpmovzx is 1 uop / 2c throughput (not fully pipelined).  See my previous comment

Repository:
  rL LLVM

https://reviews.llvm.org/D40215