[PATCH] [x86] @llvm.ctpop.v8i32 custom lowering

Thu Dec 11 04:29:33 PST 2014

Phabricator is taking forever to update the patch.
Just attached an updated version.

On Wed, Dec 10, 2014 at 3:06 PM, Bruno Cardoso Lopes
<bruno.cardoso at gmail.com> wrote:
> Hi Chandler,
>
> Thank you for the review.
>
>> I'm fine with that, but we should know *why* those vector types are
>> profitable.
>>
>> From looking at the assembly it is almost certainly the extract/insert
>> pattern that is so terribly slow here. Those instructions cause really
>> frustrating stalls in dense code.
>>
>> From your timings, it looks like it at least makes sense to do this for
>> *all* vectors with 8 or more elements (including v8i16, v16i16, etc.). The
>> timings on 4 elements are really close though.
>
> Agreed!
>
>> I think you can improve the timings for the avx1 code by interleaving the
>> operations so that they exploit ILP.
>
> Do you have in mind any specific operations?
>
>> However, you can probably improve them much more by using the 16-entry
>> lookup-table-in-a-register trick with pshufb outlined here:
>> http://wm.ite.pl/articles/sse-popcount.html
>> For i64 element types, the psadbw accumulation trick works perfectly. For
>> smaller element types, you can shift, mask, and sum to widen from a byte
>> popcount to the desired width. For i32, that's only 2 iterations. All
>> combined, v8i32 will actually be the worst case, but it still looks an
>> instruction shorter (not sure if it will be faster of course). For i16 and
>> i8 vectors it should certainly be a huge win, and i64 also looks promising
>> thanks to psadbw.
>>
>> I'm curious if that trick makes even the v4 cases profitable? Maybe even the
>> v2?
>
> I looked at this link before but haven't tried to implement/measure
> this path yet. There's a lot potential here indeed, I'm willing to try
> these out in forthcoming patches.
>
>> I think the ideal strategy is for targets to select the types for which we
>> scalarize and use a scalar popcnt instruction vs. legalizing to vector bit
>> math. I'm not sure what the best strategy is for exposing such a hook. Note
>> that you probably will need additional hooks to use the pshufb trick in the
>> lowering -- implementing a 16-way lookup table via a shuffle may not work
>> out well on all targets. =] psadbw is also terribly specific.
>
> Yep, these are the current options I see:
>
> 1) Scalarize completely: it may be faster than (3) for some targets
> (avoids further legalization + extra inserts/extracts)?
> 2) Scalarize + popcnt instructions.
> 3) General vector bit-math: we can use the bitcount strategy my patch
> currently address for x86
> 4) Custom lower: for the target specific cases (psadbw, ...)
>
> 1 and 2 is already played by vector type legalizer. We also need to be
> able to tell the legalizer to decide between 1 and 3, I guess we can
> try to check whether the target has legal vector types for add, shift,
> and and sub (or a subset of those). For any other cases 4 should be
> fine.
>
>> Yes, for *any* time where the scalar op isn't legal, this is vastly superior
>> lowering.
>
> If it doesn't constitute a problem, I rather to do this in incremental
> steps, i.e., my current plan is to first introduce x86 support for
> v8i32 (w/ and wo/ popcnt available) and for other i32/i64 vector types
> when popcount isn't available. In a next step I'll tackle i8/i64 and
> so on. Let me know whether you're ok with this approach. Will update
> it once phabricator is back.
>
> --
> Bruno Cardoso Lopes
> http://www.brunocardoso.cc

-- 
Bruno Cardoso Lopes
http://www.brunocardoso.cc
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ctpop-3.patch
Type: application/octet-stream
Size: 1084857 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141211/69095091/attachment.obj>