[PATCH] [x86] @llvm.ctpop.v8i32 custom lowering
Bruno Cardoso Lopes
bruno.cardoso at gmail.com
Thu May 21 16:27:23 PDT 2015
Hi,
This patch implements a faster vector population count based on the algorithm
described in http://wm.ite.pl/articles/sse-popcount.html
It does so by using an in-register lookup table and the pshufb instruction to
compute the popcnt for each byte. Additional instructions are then used to sum
the bytes and produce the result for wider element types. Numbers:
v4i32-avx:
----------
sselookup (v4i32): 1.10211
scalar + ctpop (v4i32): 0.907016 <-- best == ToT
parallelbitmath (v4i32): 1.14124
v8i32-avx:
----------
sselookup (v8i32): 1.97514 <-- best == patch
scalar + ctpop (v8i32): 2.37118
v8i32-avx2:
-----------
sselookup (v8i32): 1.17823
parallelbitmath (v8i32): 1.15288 <-- best == ToT
v2i64-avx:
----------
scalar + ctpop (v2i64): 0.589292 <-- best == ToT
sselookup (v2i64): 0.865797
parallelbitmath (v2i64): 1.31027
v4i64-avx:
----------
scalar + ctpop (v4i64): 0.903523 <-- best == ToT
sselookup (v4i64): 1.11988
v4i64-avx2:
-----------
scalar + ctpop (v4i64): 0.895486
sselookup (v4i64): 0.677801 <-- best == patch
parallelbitmath (v4i64): 1.02711
v16i8-avx:
----------
scalar + ctpop (v16i8): 4.1569
sselookup (v16i8): 0.508693 <-- best == patch
v32i8-avx:
----------
scalar + ctpop (v32i8): 8.32336
sselookup (v32i8): 0.961657 <-- best == patch
v32i8-avx2:
-----------
scalar + ctpop (v32i8): 8.79509
sselookup (v32i8): 0.487716 <-- best == patch
v8i16-avx:
----------
scalar + ctpop (v8i16): 1.86908
sselookup (v8i16): 0.755885 <-- best == patch
v16i16-avx:
-----------
scalar + ctpop (v16i16): 4.08575
sselookup (v16i16): 1.32838 <-- best == patch
v16i16-avx2:
------------
scalar + ctpop (v16i16): 4.19101
sselookup (v16i16): 1.18095 <-- best == patch
More info available at
https://github.com/bcardosolopes/llvm-vpopcount
One unexpected case is v8i32-avx2. Although sselookup and parallelbitmath vary
in which runs faster, I've seen the latter yielding slightly better results in
multiple runs. I would expect sselookup to always be faster because it has
fewer instructions but looks like there's some latency/resource conflict issue
going on.
Given the slightly perf diff between sselookup and parallelbitmath for
v8i32-avx2, I've removed parallelbitmath completely in this patch and left
sselookup as the default for this type too. We can later on change the behavior
for this type back to parallelbitmath (see the next paragraph).
This patch only improves the x86 specific part of vector popcnt. The previous
approach implemented for x86 in Dec 2014, the parallelbitmath, is generally
inferior. Given its target independent nature it will get resubmitted in a next
patch as a target independent expansion for vector popcnt, since (although not
anymore for x86) it's much better than the current scalar expansion we
currently do.
REPOSITORY
rL LLVM
http://reviews.llvm.org/D6531
Files:
lib/Target/X86/X86ISelLowering.cpp
lib/Target/X86/X86ISelLowering.h
lib/Target/X86/X86InstrFragmentsSIMD.td
lib/Target/X86/X86InstrSSE.td
test/CodeGen/X86/avx-popcnt.ll
test/CodeGen/X86/avx2-popcnt.ll
test/CodeGen/X86/vector-ctpop.ll
EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D6531.26286.patch
Type: text/x-patch
Size: 48553 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150521/7395d075/attachment.bin>
More information about the llvm-commits
mailing list