[PATCH] D58361: [x86] allow more 128-bit extract+shufps formation to avoid 256-bit shuffles
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Feb 19 00:17:17 PST 2019
pcordes added a comment.
Unless we're running out of registers, a loop that can hoist the load of the shuffle-control vector would certainly benefit from vpermd / vpermps in some of these cases.
On SnB-family, `vextracti/f128` and `vinserti/f128` cost the same as YMM `vperm2f128` or `vpermq` (immediate lane-crossing shuffles), and the same as `vpermd` if we have the control vector already loaded. This is *very* different from AMD bdver* / znver1 / jaguar, where insert/extract are *very* cheap, and we should avoid vperm2f128 whenever possible. (In fact emulating it with insert + extract is probably a win for tune=jaguar / bdver* / znver1. Probably not for znver2, which is expected to have 256-bit ALUs.)
It looks like some of these shuffles would be better done with a `vshufps YMM` to combine the data we want into one register, then a `vpermq YMM, imm` to reorder 64-bit chunks. On Haswell/Skylake, that gives us 1c + 3c latency, and only 2 shuffle uops. Much better than 2x vpermps + vinsertf128. (And what are we doing using FP shuffles on integer vectors? It's probably not going to cause bypass delays, but FP blends can and we use `vblendps` instead of `vpblendd`.)
If we're worried about the size of vector constants, something like `[0,2,4,6,4,6,6,7]` can easily be compressed to 8-bit elements and loaded with `vpmovzxbd` (or `sx` for negative elements). That's fine outside a loop. We wouldn't want an extra load+shuffle inside a loop, though. `vpmovzx` can't micro-fuse with a YMM destination on SnB-family.
----
In some cases I think we can use `vpackusdw` (and then a lane-crossing fixup with `vpermq` immediate), but we'd have to prepare both inputs with VPAND to zero the high input bits, creating non-negative inputs for i32 -> u16 saturation. Or `vpackuswb` for i16 -> u8. This could let us do the equivalent of `vshufps` for narrower elements. We can create the AND mask on the fly from `vpcmpeqd same,same` / `vpsrld ymm, 16`. (And for AVX512 we don't need it because AVX512 has narrowing with truncation as an option instead of saturation.)
If we already need to shift, we can use `vpsrad` to create sign-extended inputs for `vpackssdw` i32 -> i16 saturation.
================
Comment at: llvm/test/CodeGen/X86/vector-trunc-widen.ll:63
+; AVX2-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT: retq
;
----------------
This is really bad on Haswell/Skylake, where each of these shuffles costs a port 5 uop. Instead of 2x 128-bit vshufps, we should be doing 1x 256-bit vshufps and then doing a cross-lane fixup with `vpermq` if we have AVX2 available.
It's the same shuffle for both lanes, so it's crazy to extract and do it separately unless we're lacking AVX2.
```
vshufps imm8 # ymm0 = ymm1[1,3],ymm2[1,3],ymm1[5,7],ymm2[5,7]
vpermq imm8 # ymm0 = ymm0[0,2,1,3]
ret
```
This should be optimal even on bdver* / jaguar / znver1 where vpermq / vpermpd is 3 uops / 2 cycles (Zen numbers).
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D58361/new/
https://reviews.llvm.org/D58361
More information about the llvm-commits
mailing list