[PATCH] [x86] @llvm.ctpop.v8i32 custom lowering

Sat Dec 6 17:10:34 PST 2014

On Thu, Dec 4, 2014 at 8:50 PM, Bruno Cardoso Lopes <bruno.cardoso at gmail.com
> wrote:

> Chandler,
>
> Thanks for the help, the assembly for v8i32-new/old:
> http://pastebin.com/4gnd41Je
>
> About the principled split: I rather go the other way around, i.e., since
> SelectionDAGLegalize::ExpandBitCount already emits the bit-math for
> scalarized versions it makes more sense to custom split to other known
> vector types only when we already know it's profitable.
>

I'm fine with that, but we should know *why* those vector types are
profitable.

>From looking at the assembly it is almost certainly the extract/insert
pattern that is so terribly slow here. Those instructions cause really
frustrating stalls in dense code.

>From your timings, it looks like it at least makes sense to do this for
*all* vectors with 8 or more elements (including v8i16, v16i16, etc.). The
timings on 4 elements are really close though.

I think you can improve the timings for the avx1 code by interleaving the
operations so that they exploit ILP.

However, you can probably improve them much more by using the 16-entry
lookup-table-in-a-register trick with pshufb outlined here:
http://wm.ite.pl/articles/sse-popcount.html
For i64 element types, the psadbw accumulation trick works perfectly. For
smaller element types, you can shift, mask, and sum to widen from a byte
popcount to the desired width. For i32, that's only 2 iterations. All
combined, v8i32 will actually be the worst case, but it still looks an
instruction shorter (not sure if it will be faster of course). For i16 and
i8 vectors it should certainly be a huge win, and i64 also looks promising
thanks to psadbw.

I'm curious if that trick makes even the v4 cases profitable? Maybe even
the v2?

> Nadav and Hal,
>
> There are potential benefits for other targets I believe, but this
> customisation generates a bunch of vector instructions and I'm afraid that
> if one or other vector instruction isn't well supported on a target, that
> could lead to a lot of scalarized instructions which may lead to worse code
> than before? I might be wrong though. I just rather go into the direction
> that if other targets implement it and succeed, we than move it to target
> independent code. Additional thoughts?
>

I think the ideal strategy is for targets to select the types for which we
scalarize and use a scalar popcnt instruction vs. legalizing to vector bit
math. I'm not sure what the best strategy is for exposing such a hook. Note
that you probably will need additional hooks to use the pshufb trick in the
lowering -- implementing a 16-way lookup table via a shuffle may not work
out well on all targets. =] psadbw is also terribly specific.

> Actually, back to x86, if popcnt isn't supported by some x86 target it
> currently leads to this bitmath scalarized code for each element and it
> would be always profitable to emit the vectorized code instead - tested it
> for v4i32, v2i64, v4i64 and v8i32 and it performs better. Gonna update the
> patch to reflect that. For instance "-arch x86_64" doesn't assume popcnt by
> default, since it is a separate feature, in cases like this we would always
> win.
>

Yes, for *any* time where the scalar op isn't legal, this is vastly
superior lowering.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141207/a5e613e1/attachment.html>