[llvm-commits] [PATCH] Population-count loop idiom recognization

Shuxin Yang shuxin.llvm at gmail.com
Mon Nov 26 19:47:51 PST 2012


Hi, Nadav/Sean/Dmitri/all:

     The attached is the revised patch per the review comments. Compared 
to the previous patch,
the differences are:

     - use Doxygen comments for the new classes and their member 
functions; remove duplicated comments;
        remove extraneous space etc.
     - remove "private" from class LoopIdiomRecognize (by adding few 
public member functions
        to access some data member in this class)
     - replace the Stone Age raw power-of-2 implement with nifty 
isPowerOf2_32().
     - and misc others.

Thank you so much for your time and expertise!

Shuxin

On 11/19/12 12:36 PM, Nadav Rotem wrote:
> Hi Shuxin!
>
> Overall I like it, but  I have a few comments:
>
> +
> +  /// CreateCtpop - Create and insert ctpop.
> +  CallInst *CreateCtpop(Value *Val);
> +
>
> Why did you need to add this function to the IRBuilder ? I understand that the LoopIdiom pass uses the builder API for creating the memset calls, but I think that memset is much more commonly used. I think that you should create the call to popcnt using a regular call.
>
> Regarding the name, it looks like in some places we call it 'CtPop' and in other places we call it 'popcnt'.  I think that we need to fix this at some point. It does not need to be a part of this patch.
>
> +ScalarTargetTransformInfo::PopcntHwSupport
> +X86ScalarTargetTransformImpl::getPopcntHwSupport(unsigned TyWidth) const {
> +  assert(!(TyWidth & (TyWidth - 1)) && "Ty width must be power of 2");
>
> You can use the "isPowerOfTwo_64"  call.
>
>
> +    // Return true iff the block contains nothing but goto instruction.
> +    static bool isAlmostEmpty(BasicBlock *);
>
> Please doxygen the comment. (three slashes ///).  Also, please document the other functions in the class.
>
> +    static BranchInst *getBranch(BasicBlock * BB) {
> +      return dyn_cast<BranchInst>(BB->getTerminator());
> +    }
> +
>
> Maybe you could add assertions here.
>
> +  // NclPopcountRecognize - This class is to recoginize idioms of population-
> +  //  count conducted in a noncountable loop. Currently it only recognizes
> +  //  this pattern: "while(x) {cnt++; ...; x &= x - 1; ...}"
> +  //
> +  class NclPopcountRecognize {
>
> Also, doxygen this comment, and document the members.
>
>
>     class LoopIdiomRecognize : public LoopPass {
> +    friend class NclPopcountRecognize;
>       Loop *CurLoop;
> +    mutable const DataLayout *TD;
> +    mutable DominatorTree *DT;
> +    mutable ScalarEvolution *SE;
> +    mutable TargetLibraryInfo *TLI;
> +    mutable const ScalarTargetTransformInfo *STTI;
>
> Why did you have to make these mutable ?  Why is "NclPopcountRecognize" a friend ? Is there another way to solve this problem ?
>
> +
> +  private:
> +    const DataLayout *getDataLayout() const {
> +      return TD ? TD : TD=getAnalysisIfAvailable<DataLayout>();
> +    }
> +
> +    DominatorTree *getDominatorTree() const {
> +      return DT ? DT : (DT=&getAnalysis<DominatorTree>());
> +    }
> +
>
> Why do you do the lazy evaluation and not just initialize everything upon initialization ?
>
> +// return true iff the block contains nothing but goto.
> +bool LIRUtil::isAlmostEmpty(BasicBlock *BB) {
> +  if (BranchInst *Br = getBranch(BB)) {
> +    return Br->isUnconditional() && BB->size() == 1;
> +  }
> +  return false;
> +}
>
> The documentation needs to be in the class. Also, I think that you are checking for a completely empty loop. No ?
> Do you also need to check other things here, such as that this is the header of the loop ?
>
> +/// NclPopcountRecognize - Take a glimpse of the loop to see if we need to go
> +/// ahead recoginizing the idiom.
> +bool NclPopcountRecognize::preliminaryScreen() {
> +
> +  if (LIR.STTI->getPopcntHwSupport(32) != ScalarTargetTransformInfo::Fast)
>       return false;
>
> Why extra space ? Why are you only checking for 32bits ?
>
> +  // It should have a preheader containing nothing but a goto instruction.
> +  BasicBlock *PreHead = CurLoop->getLoopPreheader();
>
> We don't have goto instructions, just branches :)
>
> +  // It should have a precondition block where the generated popcount instrinsic
> +  // function will be inserted
> .
> Predondition -> preheader .
>
>
> Thanks,
> Nadav
>
> On Nov 19, 2012, at 12:09 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>
>> Hi, dear all:
>>
>>   The attached patch is to recognize this population-count pattern:
>>
>> --------------------------------------------------------------------
>> int popcount(unsigned long long a) {
>>     int c = 0;
>>     while (a) {
>>         c++;
>>         ...  // both a & c would be used multiple times in or out of loop
>>         a &= a - 1;
>>         ...
>>     }
>>
>>     return c;
>> }
>> ---------------------------------------------------------------------
>>
>> The changes are highlight bellow:
>> =================================
>>   1.Loop-idiom-Recognize pass identifies the patten and convert the releveant code into
>>     built-in function __builtin_popcount().
>>   2.CodeGen will expand the __builtin_popcount() into single instruction or a instruction sequence.
>>   3.This optimization is enabled only if underlying architecture has "fast" popcount hw support.
>>     (ScalarTargetTransformInfo::getPopcntHwSupport() returns "Fast")
>>
>> Hw support for Popcount
>> =======================
>>
>>   a) X86-64
>>      SSE4.1 provide POPCNT instruction for this purpose; Population-count can be *MORE* efficiently
>>      calcuated by SSE3* instruction sequences. However, it seems to be that CodeGen doesn't
>>      generate efficient instruction sequence when SSE3* is avaiable.
>>
>>     (http://www.strchr.com/crc32_popcnt).
>>
>>   b) ARM
>>      Evan told me that pop-count can be calculated very efficiently on ARM with some vect instruction.
>>      It seems the CodeGen doesn't take advantage of vect instr.
>>
>>   The improvement to CodeGen is on the TODO list.
>>
>>
>> Performance Impact:
>> ==================
>>    On my MacBookPro with corei7, this change see about 2% odd improvement on crafty at CPU2kint fed with
>>    Ref input (26.6s vs 26.0s, measured by at least N times).
>>
>>    The crafty is compiled with "-msse4" to enable the opt. As I mentioned above, the pop-count loop
>> will be replaced into a single POPCNT instruction. Maybe it is more desirable to be replaced with a
>> SSE3* instruction sequences as it is faster than POPCNT instruction.
>>
>> Tons of thanks in advance!
>>
>> Best Regards
>> Shuxin
>> <popcount.patch>_______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
A non-text attachment was scrubbed...
Name: popcount.v2.patch
Type: text/x-patch
Size: 27889 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20121126/f22c1a64/attachment.bin>


More information about the llvm-commits mailing list