[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.

Thu May 28 02:47:57 PDT 2015

Update this with an even better algorithm that Fiora came up with when we were
discussing this in IRC.

By using PUNPCKLDQ and PUNPCKHDQ to interleave the i32 elements with zeros so
that we can use PSADBW to sum 8 bytes worth of bytes horizontally, we end up
with the results of the PSADBW being laid out perfectly to concatenate and
shrink in a single instruction with PACKUSWB. These all pipeline nicely with
the PSADBW instructions resulting in even lower latency and better throughput
than before.

We're down to an insane 10.45 cycle block throughput for this code sequence
compared to 13 for scalarized popcnt on Ivybridge. (12 vs. 13 on Haswell)

http://reviews.llvm.org/D10084

Files:
  lib/Target/X86/X86ISelLowering.cpp
  lib/Target/X86/X86ISelLowering.h
  lib/Target/X86/X86InstrFragmentsSIMD.td
  lib/Target/X86/X86InstrSSE.td
  test/CodeGen/X86/avx-popcnt.ll
  test/CodeGen/X86/avx2-popcnt.ll

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D10084.26668.patch
Type: text/x-patch
Size: 154924 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150528/9065ea2d/attachment.bin>