[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.

Thu May 28 09:53:28 PDT 2015

I believe the same approach would work on ARM64, which also as byte-wise vector popcounts and can do interleave-with-zero.  Do you think it would be worthwhile to find a way to share the core of this approach?

—Owen

> On May 28, 2015, at 2:47 AM, Chandler Carruth <chandlerc at gmail.com> wrote:
> 
> Update this with an even better algorithm that Fiora came up with when we were
> discussing this in IRC.
> 
> By using PUNPCKLDQ and PUNPCKHDQ to interleave the i32 elements with zeros so
> that we can use PSADBW to sum 8 bytes worth of bytes horizontally, we end up
> with the results of the PSADBW being laid out perfectly to concatenate and
> shrink in a single instruction with PACKUSWB. These all pipeline nicely with
> the PSADBW instructions resulting in even lower latency and better throughput
> than before.
> 
> We're down to an insane 10.45 cycle block throughput for this code sequence
> compared to 13 for scalarized popcnt on Ivybridge. (12 vs. 13 on Haswell)
> 
> 
> http://reviews.llvm.org/D10084
> 
> Files:
>  lib/Target/X86/X86ISelLowering.cpp
>  lib/Target/X86/X86ISelLowering.h
>  lib/Target/X86/X86InstrFragmentsSIMD.td
>  lib/Target/X86/X86InstrSSE.td
>  test/CodeGen/X86/avx-popcnt.ll
>  test/CodeGen/X86/avx2-popcnt.ll
> 
> EMAIL PREFERENCES
>  http://reviews.llvm.org/settings/panel/emailpreferences/
> <D10084.26668.patch>_______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits