[PATCH] [x86] @llvm.ctpop.v8i32 custom lowering

Thu Dec 4 08:36:58 PST 2014

LGTM!  Thank you for the detailed measurements!  

Do you think that other targets may also benefit from this kind of transformation? 

Thanks,
Nadav

> On Dec 4, 2014, at 8:23 AM, Bruno Cardoso Lopes <bruno.cardoso at gmail.com> wrote:
> 
> Hi nadav, chandlerc, andreadb, delena,
> 
> This patch adds x86 custom lowering for the @llvm.ctpop.v8i32 intrinsic.
> 
> Currently, the expansion of @llvm.ctpop.v8i32 uses vector element extractions,
> insertions and individual calls to @llvm.ctpop.i32. Local haswell measurements
> show that @llvm.ctpop.v8i32 gets faster by using vector parallel bit twiddling approaches
> than using @llvm.ctpop.i32 for each element, based on:
> 
> v = v - ((v >> 1) & 0x55555555);
> v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
> v = ((v + (v >> 4) & 0xF0F0F0F)
> v = v + (v >> 8)
> v = v + (v >> 16)
> v = v & 0x0000003F
> (from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)
> 
> Some toy microbenchmark presented a ~2x speedup, whereas vector types with smaller number of elements
> are still better with the old approach (see results below). Hence this
> patch only implements it for v8i32 type. The results indicate it might also be profitable
> to implement this approach for v32i8 and v16i16, but I haven't measured that yet.
> 
> AVX1 ctpop.v8i32 is broken into two ctpop.v4i32, which is only slightly better than old expansion. However,
> this patch does not implement custom lowering for the general ctpop.v4i32 type, since it's not profitable.
> 
> == [core-avx2]
> v8i32-new: 10.3506
> v8i32-old: 18.3879
> v4i32-new: 10.3699
> v4i32-old: 8.01387
> v4i64-new: 11.7464
> v4i64-old: 10.3043
> v2i64-new: 11.7922
> v2i64-old: 5.20916
> 
> == [corei7-avx]
> v8i32-new: 16.5359
> v8i32-old: 18.2479
> v4i32-new: 10.2069
> v4i32-old: 8.03686
> v4i64-new: 17.8085
> v4i64-old: 10.2366
> v2i64-new: 11.7623
> v2i64-old: 5.11533
> 
> http://reviews.llvm.org/D6531
> 
> Files:
>  lib/Target/X86/X86ISelLowering.cpp
>  test/CodeGen/X86/vector-ctpop.ll
> <D6531.16929.patch>