[llvm-commits] [PATCH] Population-count loop idiom recognization
Hal Finkel
hfinkel at anl.gov
Mon Nov 19 12:44:58 PST 2012
----- Original Message -----
> From: "Shuxin Yang" <shuxin.llvm at gmail.com>
> To: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Monday, November 19, 2012 12:09:39 PM
> Subject: [llvm-commits] [PATCH] Population-count loop idiom recognization
>
> Hi, dear all:
>
> The attached patch is to recognize this population-count pattern:
>
> --------------------------------------------------------------------
> int popcount(unsigned long long a) {
> int c = 0;
> while (a) {
> c++;
> ... // both a & c would be used multiple times in or out of
> loop
> a &= a - 1;
> ...
> }
>
> return c;
> }
> ---------------------------------------------------------------------
>
> The changes are highlight bellow:
> =================================
> 1.Loop-idiom-Recognize pass identifies the patten and convert the
> releveant code into
> built-in function __builtin_popcount().
> 2.CodeGen will expand the __builtin_popcount() into single
> instruction or a instruction sequence.
> 3.This optimization is enabled only if underlying architecture has
> "fast" popcount hw support.
> (ScalarTargetTransformInfo::getPopcntHwSupport() returns "Fast")
>
> Hw support for Popcount
> =======================
>
> a) X86-64
> SSE4.1 provide POPCNT instruction for this purpose;
> Population-count can be *MORE* efficiently
> calcuated by SSE3* instruction sequences. However, it seems to
> be
> that CodeGen doesn't
> generate efficient instruction sequence when SSE3* is avaiable.
>
> (http://www.strchr.com/crc32_popcnt).
>
> b) ARM
> Evan told me that pop-count can be calculated very efficiently
> on
> ARM with some vect instruction.
> It seems the CodeGen doesn't take advantage of vect instr.
>
> The improvement to CodeGen is on the TODO list.
Likewise, PowerPC has popcnt{d,b,w}; I can add CodeGen support for these as well.
-Hal
>
>
> Performance Impact:
> ==================
> On my MacBookPro with corei7, this change see about 2% odd
> improvement on crafty at CPU2kint fed with
> Ref input (26.6s vs 26.0s, measured by at least N times).
>
> The crafty is compiled with "-msse4" to enable the opt. As I
> mentioned above, the pop-count loop
> will be replaced into a single POPCNT instruction. Maybe it is more
> desirable to be replaced with a
> SSE3* instruction sequences as it is faster than POPCNT instruction.
>
> Tons of thanks in advance!
>
> Best Regards
> Shuxin
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-commits
mailing list