[PATCH] [x86] @llvm.ctpop.v8i32 custom lowering

Wed Dec 10 09:06:44 PST 2014

Hi Chandler,

Thank you for the review.

> I'm fine with that, but we should know *why* those vector types are
> profitable.
>
> From looking at the assembly it is almost certainly the extract/insert
> pattern that is so terribly slow here. Those instructions cause really
> frustrating stalls in dense code.
>
> From your timings, it looks like it at least makes sense to do this for
> *all* vectors with 8 or more elements (including v8i16, v16i16, etc.). The
> timings on 4 elements are really close though.

Agreed!

> I think you can improve the timings for the avx1 code by interleaving the
> operations so that they exploit ILP.

Do you have in mind any specific operations?

> However, you can probably improve them much more by using the 16-entry
> lookup-table-in-a-register trick with pshufb outlined here:
> http://wm.ite.pl/articles/sse-popcount.html
> For i64 element types, the psadbw accumulation trick works perfectly. For
> smaller element types, you can shift, mask, and sum to widen from a byte
> popcount to the desired width. For i32, that's only 2 iterations. All
> combined, v8i32 will actually be the worst case, but it still looks an
> instruction shorter (not sure if it will be faster of course). For i16 and
> i8 vectors it should certainly be a huge win, and i64 also looks promising
> thanks to psadbw.
>
> I'm curious if that trick makes even the v4 cases profitable? Maybe even the
> v2?

I looked at this link before but haven't tried to implement/measure
this path yet. There's a lot potential here indeed, I'm willing to try
these out in forthcoming patches.

> I think the ideal strategy is for targets to select the types for which we
> scalarize and use a scalar popcnt instruction vs. legalizing to vector bit
> math. I'm not sure what the best strategy is for exposing such a hook. Note
> that you probably will need additional hooks to use the pshufb trick in the
> lowering -- implementing a 16-way lookup table via a shuffle may not work
> out well on all targets. =] psadbw is also terribly specific.

Yep, these are the current options I see:

1) Scalarize completely: it may be faster than (3) for some targets
(avoids further legalization + extra inserts/extracts)?
2) Scalarize + popcnt instructions.
3) General vector bit-math: we can use the bitcount strategy my patch
currently address for x86
4) Custom lower: for the target specific cases (psadbw, ...)

1 and 2 is already played by vector type legalizer. We also need to be
able to tell the legalizer to decide between 1 and 3, I guess we can
try to check whether the target has legal vector types for add, shift,
and and sub (or a subset of those). For any other cases 4 should be
fine.

> Yes, for *any* time where the scalar op isn't legal, this is vastly superior
> lowering.

If it doesn't constitute a problem, I rather to do this in incremental
steps, i.e., my current plan is to first introduce x86 support for
v8i32 (w/ and wo/ popcnt available) and for other i32/i64 vector types
when popcount isn't available. In a next step I'll tackle i8/i64 and
so on. Let me know whether you're ok with this approach. Will update
it once phabricator is back.

-- 
Bruno Cardoso Lopes
http://www.brunocardoso.cc