[PATCH] D40215: [X86][AVX512] Use PACKSS/PACKUS for vXi16->vXi8 truncations without BWI.
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sat Dec 30 21:26:28 PST 2017
pcordes added a comment.
KNL has slow `vpackwb ymm`. It doesn't look optimal for most of these cases: vpmovzxwd / vpmovdb looks better. For AVX512F without BW+DQ, we should probably go ahead and use it, i.e. assume that AVX2 instructions are fast.
Also note that KNL has faster vpmovzx than vpmovsx.
================
Comment at: test/CodeGen/X86/avx512-ext.ll:1725
+; KNL-NEXT: vpcmpeqw %ymm2, %ymm0, %ymm0
+; KNL-NEXT: vpacksswb %ymm1, %ymm0, %ymm0
+; KNL-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
----------------
AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for `pshufb`). This sequence would make sense for AVX512F `tune=generic` (because it's very good on SKX and presumably future mainstream CPUs with good AVX2 support), but definitely *not* for `-march=knl`.
`vpacksswb xmm` is fast: 1 uop / 1c throughput / 2-6c latency, but the YMM versions of vpack / vpunpck (except for DQ and QDQ) are 5 uops / 9c throughput.
In this case: vpacksswb ymm + vpermq ymm = 5 + 1 = 6 shuffle uops, and maybe 10c throughput (9 + 1 assuming they all compete for the same execution resources and can't pipeline with each other).
2x vpmovzxwd y->z + 2x vpmovdb z->x + vinserti128 x->y = 5 shuffle uops, throughput = 2x2 + 2x1 + 1 = 7 cycles (with no decode stalls from multi-uop instructions). The extra ILP probably doesn't help at all because it appears there's only one shuffle execution unit (on FP0). So it's not *much* better, but avoiding the decode bottleneck should allow much better out-of-order execution and probably hyperthreading friendliness.
================
Comment at: test/CodeGen/X86/avx512-ext.ll:1728
-; KNL-NEXT: vpmovsxwd %ymm1, %zmm1
-; KNL-NEXT: vpmovdb %zmm1, %xmm1
-; KNL-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
----------------
`vpmovsx` (all forms) is 2 uops on KNL, vs. 1 for `vpmovzx` (all element / register sizes). This is a big deal for the front-end (2c throughput vs. 7-8c throughput). If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX.
If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables.
================
Comment at: test/CodeGen/X86/avx512-trunc.ll:568
+; KNL-NEXT: vextracti128 $1, %ymm0, %xmm1
+; KNL-NEXT: vpackuswb %xmm1, %xmm0, %xmm0
; KNL-NEXT: vzeroupper
----------------
This is a win: two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is definitely better than `vpmovzx` (1 uop / 2c throughput) / `vpmovdb` (1 uop / 1c throughput).
And vpmovSX is 2 uops, 7-8c throughput (decode bottleneck), so the original was horrible because of the missed vpmovzx optimization, but the vpackuswb version is still better than that because it's only using XMM registers.
================
Comment at: test/CodeGen/X86/vector-compare-results.ll:320
; AVX512F-NEXT: vpcmpgtw %ymm1, %ymm0, %ymm0
-; AVX512F-NEXT: vpmovsxwd %ymm0, %zmm0
-; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
+; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512F-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
----------------
delena wrote:
> You are inserting AVX2 instructions instead of AVX-512, right? If yes, the prev code is better, since we have more registers in AVX-512.
In most cases that's hopefully minor compared to the shuffle throughput gain from using vextracti128 / vpackss xmm (both 1c throughput on KNL).
vpmovsx is 2 uops on KNL, so it's a big missed optimization to use it instead of vpmovzx, but even vpmovzx is 1 uop / 2c throughput (not fully pipelined). See my previous comment
Repository:
rL LLVM
https://reviews.llvm.org/D40215
More information about the llvm-commits
mailing list