[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.
resistor at mac.com
Thu May 28 09:53:28 PDT 2015
I believe the same approach would work on ARM64, which also as byte-wise vector popcounts and can do interleave-with-zero. Do you think it would be worthwhile to find a way to share the core of this approach?
> On May 28, 2015, at 2:47 AM, Chandler Carruth <chandlerc at gmail.com> wrote:
> Update this with an even better algorithm that Fiora came up with when we were
> discussing this in IRC.
> By using PUNPCKLDQ and PUNPCKHDQ to interleave the i32 elements with zeros so
> that we can use PSADBW to sum 8 bytes worth of bytes horizontally, we end up
> with the results of the PSADBW being laid out perfectly to concatenate and
> shrink in a single instruction with PACKUSWB. These all pipeline nicely with
> the PSADBW instructions resulting in even lower latency and better throughput
> than before.
> We're down to an insane 10.45 cycle block throughput for this code sequence
> compared to 13 for scalarized popcnt on Ivybridge. (12 vs. 13 on Haswell)
> EMAIL PREFERENCES
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
More information about the llvm-commits