[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.
chandlerc at gmail.com
Thu May 28 02:47:57 PDT 2015
Update this with an even better algorithm that Fiora came up with when we were
discussing this in IRC.
By using PUNPCKLDQ and PUNPCKHDQ to interleave the i32 elements with zeros so
that we can use PSADBW to sum 8 bytes worth of bytes horizontally, we end up
with the results of the PSADBW being laid out perfectly to concatenate and
shrink in a single instruction with PACKUSWB. These all pipeline nicely with
the PSADBW instructions resulting in even lower latency and better throughput
We're down to an insane 10.45 cycle block throughput for this code sequence
compared to 13 for scalarized popcnt on Ivybridge. (12 vs. 13 on Haswell)
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 154924 bytes
Desc: not available
More information about the llvm-commits