[PATCH] [x86] @llvm.ctpop.v8i32 custom lowering

Thu Dec 4 08:48:25 PST 2014

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: reviews+D6531+public+eee6cba39a572125 at reviews.llvm.org
> Cc: llvm-commits at cs.uiuc.edu
> Sent: Thursday, December 4, 2014 10:36:58 AM
> Subject: Re: [PATCH] [x86] @llvm.ctpop.v8i32 custom lowering
> 
> LGTM!  Thank you for the detailed measurements!
> 
> Do you think that other targets may also benefit from this kind of
> transformation?

I'm fairly certain that PowerPC would also benefit (but I've not tried it). Putting this in DAGCombine would likely make sense, there doesn't seem to be anything target specific here. Can you post the toy benchmark you used?

Thanks again,
Hal

> 
> Thanks,
> Nadav
> 
> > On Dec 4, 2014, at 8:23 AM, Bruno Cardoso Lopes
> > <bruno.cardoso at gmail.com> wrote:
> > 
> > Hi nadav, chandlerc, andreadb, delena,
> > 
> > This patch adds x86 custom lowering for the @llvm.ctpop.v8i32
> > intrinsic.
> > 
> > Currently, the expansion of @llvm.ctpop.v8i32 uses vector element
> > extractions,
> > insertions and individual calls to @llvm.ctpop.i32. Local haswell
> > measurements
> > show that @llvm.ctpop.v8i32 gets faster by using vector parallel
> > bit twiddling approaches
> > than using @llvm.ctpop.i32 for each element, based on:
> > 
> > v = v - ((v >> 1) & 0x55555555);
> > v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
> > v = ((v + (v >> 4) & 0xF0F0F0F)
> > v = v + (v >> 8)
> > v = v + (v >> 16)
> > v = v & 0x0000003F
> > (from
> > http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)
> > 
> > Some toy microbenchmark presented a ~2x speedup, whereas vector
> > types with smaller number of elements
> > are still better with the old approach (see results below). Hence
> > this
> > patch only implements it for v8i32 type. The results indicate it
> > might also be profitable
> > to implement this approach for v32i8 and v16i16, but I haven't
> > measured that yet.
> > 
> > AVX1 ctpop.v8i32 is broken into two ctpop.v4i32, which is only
> > slightly better than old expansion. However,
> > this patch does not implement custom lowering for the general
> > ctpop.v4i32 type, since it's not profitable.
> > 
> > == [core-avx2]
> > v8i32-new: 10.3506
> > v8i32-old: 18.3879
> > v4i32-new: 10.3699
> > v4i32-old: 8.01387
> > v4i64-new: 11.7464
> > v4i64-old: 10.3043
> > v2i64-new: 11.7922
> > v2i64-old: 5.20916
> > 
> > == [corei7-avx]
> > v8i32-new: 16.5359
> > v8i32-old: 18.2479
> > v4i32-new: 10.2069
> > v4i32-old: 8.03686
> > v4i64-new: 17.8085
> > v4i64-old: 10.2366
> > v2i64-new: 11.7623
> > v2i64-old: 5.11533
> > 
> > http://reviews.llvm.org/D6531
> > 
> > Files:
> >  lib/Target/X86/X86ISelLowering.cpp
> >  test/CodeGen/X86/vector-ctpop.ll
> > <D6531.16929.patch>
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>