[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.
Owen Anderson
resistor at mac.com
Thu May 28 09:53:28 PDT 2015
I believe the same approach would work on ARM64, which also as byte-wise vector popcounts and can do interleave-with-zero. Do you think it would be worthwhile to find a way to share the core of this approach?
—Owen
> On May 28, 2015, at 2:47 AM, Chandler Carruth <chandlerc at gmail.com> wrote:
>
> Update this with an even better algorithm that Fiora came up with when we were
> discussing this in IRC.
>
> By using PUNPCKLDQ and PUNPCKHDQ to interleave the i32 elements with zeros so
> that we can use PSADBW to sum 8 bytes worth of bytes horizontally, we end up
> with the results of the PSADBW being laid out perfectly to concatenate and
> shrink in a single instruction with PACKUSWB. These all pipeline nicely with
> the PSADBW instructions resulting in even lower latency and better throughput
> than before.
>
> We're down to an insane 10.45 cycle block throughput for this code sequence
> compared to 13 for scalarized popcnt on Ivybridge. (12 vs. 13 on Haswell)
>
>
> http://reviews.llvm.org/D10084
>
> Files:
> lib/Target/X86/X86ISelLowering.cpp
> lib/Target/X86/X86ISelLowering.h
> lib/Target/X86/X86InstrFragmentsSIMD.td
> lib/Target/X86/X86InstrSSE.td
> test/CodeGen/X86/avx-popcnt.ll
> test/CodeGen/X86/avx2-popcnt.ll
>
> EMAIL PREFERENCES
> http://reviews.llvm.org/settings/panel/emailpreferences/
> <D10084.26668.patch>_______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
More information about the llvm-commits
mailing list