[PATCH] [x86] @llvm.ctpop.v8i32 custom lowering

Thu May 21 16:27:23 PDT 2015

Hi,

This patch implements a faster vector population count based on the algorithm
described in http://wm.ite.pl/articles/sse-popcount.html

It does so by using an in-register lookup table and the pshufb instruction to
compute the popcnt for each byte. Additional instructions are then used to sum
the bytes and produce the result for wider element types. Numbers:

v4i32-avx:
----------

sselookup (v4i32): 1.10211
scalar + ctpop (v4i32): 0.907016 <-- best == ToT
parallelbitmath (v4i32): 1.14124

v8i32-avx:
----------

sselookup (v8i32): 1.97514 <-- best == patch
scalar + ctpop (v8i32): 2.37118

v8i32-avx2:
-----------

sselookup (v8i32): 1.17823
parallelbitmath (v8i32): 1.15288 <-- best == ToT

v2i64-avx:
----------

scalar + ctpop (v2i64): 0.589292 <-- best == ToT
sselookup (v2i64): 0.865797
parallelbitmath (v2i64): 1.31027

v4i64-avx:
----------

scalar + ctpop (v4i64): 0.903523 <-- best == ToT
sselookup (v4i64): 1.11988

v4i64-avx2:
-----------

scalar + ctpop (v4i64): 0.895486
sselookup (v4i64): 0.677801 <-- best == patch
parallelbitmath (v4i64): 1.02711

v16i8-avx:
----------

scalar + ctpop (v16i8): 4.1569
sselookup (v16i8): 0.508693 <-- best == patch

v32i8-avx:
----------

scalar + ctpop (v32i8): 8.32336
sselookup (v32i8): 0.961657 <-- best == patch

v32i8-avx2:
-----------

scalar + ctpop (v32i8): 8.79509
sselookup (v32i8): 0.487716 <-- best == patch

v8i16-avx:
----------

scalar + ctpop (v8i16): 1.86908
sselookup (v8i16): 0.755885 <-- best == patch

v16i16-avx:
-----------

scalar + ctpop (v16i16): 4.08575
sselookup (v16i16): 1.32838 <-- best == patch

v16i16-avx2:
------------

scalar + ctpop (v16i16): 4.19101
sselookup (v16i16): 1.18095 <-- best == patch

More info available at
https://github.com/bcardosolopes/llvm-vpopcount

One unexpected case is v8i32-avx2. Although sselookup and parallelbitmath vary
in which runs faster, I've seen the latter yielding slightly better results in
multiple runs. I would expect sselookup to always be faster because it has
fewer instructions but looks like there's some latency/resource conflict issue
going on.

Given the slightly perf diff between sselookup and parallelbitmath for
v8i32-avx2, I've removed parallelbitmath completely in this patch and left
sselookup as the default for this type too. We can later on change the behavior
for this type back to parallelbitmath (see the next paragraph).

This patch only improves the x86 specific part of vector popcnt. The previous
approach implemented for x86 in Dec 2014, the parallelbitmath, is generally
inferior. Given its target independent nature it will get resubmitted in a next
patch as a target independent expansion for vector popcnt, since (although not
anymore for x86) it's much better than the current scalar expansion we
currently do.

REPOSITORY
  rL LLVM

http://reviews.llvm.org/D6531

Files:
  lib/Target/X86/X86ISelLowering.cpp
  lib/Target/X86/X86ISelLowering.h
  lib/Target/X86/X86InstrFragmentsSIMD.td
  lib/Target/X86/X86InstrSSE.td
  test/CodeGen/X86/avx-popcnt.ll
  test/CodeGen/X86/avx2-popcnt.ll
  test/CodeGen/X86/vector-ctpop.ll

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D6531.26286.patch
Type: text/x-patch
Size: 48553 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150521/7395d075/attachment.bin>