[llvm] [X86] Fold VPERMV3(X, M, Y) -> VPERMV(CONCAT(X, Y), WIDEN(M)) iff the CONCAT is free (PR #122485)
Phoebe Wang via llvm-commits
llvm-commits at lists.llvm.org
Sat Jan 11 16:38:05 PST 2025
================
@@ -65,10 +65,9 @@ define void @shuffle_v16i32_to_v8i32_1(ptr %L, ptr %S) nounwind {
;
; AVX512BWVL-FAST-ALL-LABEL: shuffle_v16i32_to_v8i32_1:
; AVX512BWVL-FAST-ALL: # %bb.0:
-; AVX512BWVL-FAST-ALL-NEXT: vmovdqa (%rdi), %ymm0
-; AVX512BWVL-FAST-ALL-NEXT: vpmovsxbd {{.*#+}} ymm1 = [1,3,5,7,9,11,13,15]
-; AVX512BWVL-FAST-ALL-NEXT: vpermi2d 32(%rdi), %ymm0, %ymm1
-; AVX512BWVL-FAST-ALL-NEXT: vmovdqa %ymm1, (%rsi)
+; AVX512BWVL-FAST-ALL-NEXT: vmovaps {{.*#+}} ymm0 = [1,3,5,7,9,11,13,15]
+; AVX512BWVL-FAST-ALL-NEXT: vpermps (%rdi), %zmm0, %zmm0
----------------
phoebewang wrote:
But we have the equivalent integer instruction VPERMD, which can solve domain stalls, isn't it?
Though the 64-bit load vs. 256-bit load is a problem, my initial question was just about
`vpermi2d 32(%rdi), %ymm0, %ymm1 ; 256-bit load`
vs
`vpermps (%rdi), %zmm0, %zmm0 ; 512-bit load`
So put them together, it looks like a negative benefit to me.
https://github.com/llvm/llvm-project/pull/122485
More information about the llvm-commits
mailing list