[PATCH] D54668: [X86] Attempt to improve v32i8/v64i8 multiply lowering by applying the v16i8 non-avx2 algorithm to each 128-bit lane.

Sun Nov 18 05:16:43 PST 2018

RKSimon added a comment.

What does IACA/LLVM-MCA say about the regressions in min-legal-vector-width.ll and vector-reduce-mul.ll

================
Comment at: test/CodeGen/X86/vector-mul.ll:575
 ; X64-XOP-NEXT:    vpmullw {{.*}}(%rip), %xmm0, %xmm0
-; X64-XOP-NEXT:    vpperm {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],xmm1[0,2,4,6,8,10,12,14]
+; X64-XOP-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; X64-XOP-NEXT:    vpackuswb %xmm1, %xmm0, %xmm0
----------------
craig.topper wrote:
> We lost the combine here that turned the and+packuswb into vpperm between vector op legalization and dag combine. I'm not sure why shuffle combining wasn't able to do the same with the regular shuffle.
This is rather odd - I'll take a look once this has landed.

Repository:
  rL LLVM

https://reviews.llvm.org/D54668