[PATCH] D54668: [X86] Attempt to improve v32i8/v64i8 multiply lowering by applying the v16i8 non-avx2 algorithm to each 128-bit lane.

Sat Nov 17 11:34:49 PST 2018

craig.topper created this revision.
craig.topper added reviewers: RKSimon, spatel.

Previously we split the vectors in half to allow the two halves to be any extended then concatenated the results back together.

This patch instead instead extends the v16i8 sse algorithm to split each and extend half of each 128-bit lane using punpcklbw/punpckhbw. Multiplies all the low half lanes and high half lanes together in separate operations. Then merges the half lane results back together using packuswb.

Unfortunately, some of the cases in vector-reduce-mul.ll regress because we aren't narrowing the vector width of the multiplies as we reduce. The splitting was somewhat making up for that before by causing halves to be discarded after the split.

Repository:
  rL LLVM

https://reviews.llvm.org/D54668

Files:
  lib/Target/X86/X86ISelLowering.cpp
  test/CodeGen/X86/avx2-arith.ll
  test/CodeGen/X86/min-legal-vector-width.ll
  test/CodeGen/X86/pmul.ll
  test/CodeGen/X86/prefer-avx256-wide-mul.ll
  test/CodeGen/X86/vector-mul.ll
  test/CodeGen/X86/vector-reduce-mul.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D54668.174511.patch
Type: text/x-patch
Size: 205446 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20181117/09add742/attachment-0001.bin>