[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.

I believe the same approach would work on ARM64, which also as byte-wise vector popcounts and can do interleave-with-zero.  Do you think it would be worthwhile to find a way to share the core of this approach?


> Update this with an even better algorithm that Fiora came up with when we were
> discussing this in IRC.
> By using PUNPCKLDQ and PUNPCKHDQ to interleave the i32 elements with zeros so
> that we can use PSADBW to sum 8 bytes worth of bytes horizontally, we end up
> with the results of the PSADBW being laid out perfectly to concatenate and
> shrink in a single instruction with PACKUSWB. These all pipeline nicely with
> the PSADBW instructions resulting in even lower latency and better throughput
> than before.
> We're down to an insane 10.45 cycle block throughput for this code sequence
> compared to 13 for scalarized popcnt on Ivybridge. (12 vs. 13 on Haswell)
> http://reviews.llvm.org/D10084
