[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.

Bruno Cardoso Lopes bruno.cardoso at gmail.com
Fri May 29 11:15:58 PDT 2015


> I wasn’t actually suggesting not using ARM64’s native vXi8 instructions, but rather the rest of the sequence that synthesizes wider lane pop counts on top of it.

I see, but even the *best* rest of the sequence is different for
distinct vector element types in x86, maybe that could be true for
ARM64 as well (I'm not very much versed in ARM's vector instructions).
We could definitely do something like this in the vector legalizer
ExpandPopCount:

if target has vXi8 ctpop != expand *and*
TLI.is-cheap-to-interleave-with-zero-or-whatever-name(...)
   Ctpop = getNode(CTPOP, vxi8...)
   ....
   <target independent vector ops>
....

I think it would be worthwhile to find a way to share the core of this
approach, but I'm curious whether x86 and ARM64 share the fastest
paths using the same type of vector instructions (i.e. interleaves) to
build on top of it.

-- 
Bruno Cardoso Lopes
http://www.brunocardoso.cc




More information about the llvm-commits mailing list