[llvm-commits] [PATCH] Population-count loop idiom recognization
Shuxin Yang
shuxin.llvm at gmail.com
Mon Nov 26 19:47:51 PST 2012
Hi, Nadav/Sean/Dmitri/all:
The attached is the revised patch per the review comments. Compared
to the previous patch,
the differences are:
- use Doxygen comments for the new classes and their member
functions; remove duplicated comments;
remove extraneous space etc.
- remove "private" from class LoopIdiomRecognize (by adding few
public member functions
to access some data member in this class)
- replace the Stone Age raw power-of-2 implement with nifty
isPowerOf2_32().
- and misc others.
Thank you so much for your time and expertise!
Shuxin
On 11/19/12 12:36 PM, Nadav Rotem wrote:
> Hi Shuxin!
>
> Overall I like it, but I have a few comments:
>
> +
> + /// CreateCtpop - Create and insert ctpop.
> + CallInst *CreateCtpop(Value *Val);
> +
>
> Why did you need to add this function to the IRBuilder ? I understand that the LoopIdiom pass uses the builder API for creating the memset calls, but I think that memset is much more commonly used. I think that you should create the call to popcnt using a regular call.
>
> Regarding the name, it looks like in some places we call it 'CtPop' and in other places we call it 'popcnt'. I think that we need to fix this at some point. It does not need to be a part of this patch.
>
> +ScalarTargetTransformInfo::PopcntHwSupport
> +X86ScalarTargetTransformImpl::getPopcntHwSupport(unsigned TyWidth) const {
> + assert(!(TyWidth & (TyWidth - 1)) && "Ty width must be power of 2");
>
> You can use the "isPowerOfTwo_64" call.
>
>
> + // Return true iff the block contains nothing but goto instruction.
> + static bool isAlmostEmpty(BasicBlock *);
>
> Please doxygen the comment. (three slashes ///). Also, please document the other functions in the class.
>
> + static BranchInst *getBranch(BasicBlock * BB) {
> + return dyn_cast<BranchInst>(BB->getTerminator());
> + }
> +
>
> Maybe you could add assertions here.
>
> + // NclPopcountRecognize - This class is to recoginize idioms of population-
> + // count conducted in a noncountable loop. Currently it only recognizes
> + // this pattern: "while(x) {cnt++; ...; x &= x - 1; ...}"
> + //
> + class NclPopcountRecognize {
>
> Also, doxygen this comment, and document the members.
>
>
> class LoopIdiomRecognize : public LoopPass {
> + friend class NclPopcountRecognize;
> Loop *CurLoop;
> + mutable const DataLayout *TD;
> + mutable DominatorTree *DT;
> + mutable ScalarEvolution *SE;
> + mutable TargetLibraryInfo *TLI;
> + mutable const ScalarTargetTransformInfo *STTI;
>
> Why did you have to make these mutable ? Why is "NclPopcountRecognize" a friend ? Is there another way to solve this problem ?
>
> +
> + private:
> + const DataLayout *getDataLayout() const {
> + return TD ? TD : TD=getAnalysisIfAvailable<DataLayout>();
> + }
> +
> + DominatorTree *getDominatorTree() const {
> + return DT ? DT : (DT=&getAnalysis<DominatorTree>());
> + }
> +
>
> Why do you do the lazy evaluation and not just initialize everything upon initialization ?
>
> +// return true iff the block contains nothing but goto.
> +bool LIRUtil::isAlmostEmpty(BasicBlock *BB) {
> + if (BranchInst *Br = getBranch(BB)) {
> + return Br->isUnconditional() && BB->size() == 1;
> + }
> + return false;
> +}
>
> The documentation needs to be in the class. Also, I think that you are checking for a completely empty loop. No ?
> Do you also need to check other things here, such as that this is the header of the loop ?
>
> +/// NclPopcountRecognize - Take a glimpse of the loop to see if we need to go
> +/// ahead recoginizing the idiom.
> +bool NclPopcountRecognize::preliminaryScreen() {
> +
> + if (LIR.STTI->getPopcntHwSupport(32) != ScalarTargetTransformInfo::Fast)
> return false;
>
> Why extra space ? Why are you only checking for 32bits ?
>
> + // It should have a preheader containing nothing but a goto instruction.
> + BasicBlock *PreHead = CurLoop->getLoopPreheader();
>
> We don't have goto instructions, just branches :)
>
> + // It should have a precondition block where the generated popcount instrinsic
> + // function will be inserted
> .
> Predondition -> preheader .
>
>
> Thanks,
> Nadav
>
> On Nov 19, 2012, at 12:09 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>
>> Hi, dear all:
>>
>> The attached patch is to recognize this population-count pattern:
>>
>> --------------------------------------------------------------------
>> int popcount(unsigned long long a) {
>> int c = 0;
>> while (a) {
>> c++;
>> ... // both a & c would be used multiple times in or out of loop
>> a &= a - 1;
>> ...
>> }
>>
>> return c;
>> }
>> ---------------------------------------------------------------------
>>
>> The changes are highlight bellow:
>> =================================
>> 1.Loop-idiom-Recognize pass identifies the patten and convert the releveant code into
>> built-in function __builtin_popcount().
>> 2.CodeGen will expand the __builtin_popcount() into single instruction or a instruction sequence.
>> 3.This optimization is enabled only if underlying architecture has "fast" popcount hw support.
>> (ScalarTargetTransformInfo::getPopcntHwSupport() returns "Fast")
>>
>> Hw support for Popcount
>> =======================
>>
>> a) X86-64
>> SSE4.1 provide POPCNT instruction for this purpose; Population-count can be *MORE* efficiently
>> calcuated by SSE3* instruction sequences. However, it seems to be that CodeGen doesn't
>> generate efficient instruction sequence when SSE3* is avaiable.
>>
>> (http://www.strchr.com/crc32_popcnt).
>>
>> b) ARM
>> Evan told me that pop-count can be calculated very efficiently on ARM with some vect instruction.
>> It seems the CodeGen doesn't take advantage of vect instr.
>>
>> The improvement to CodeGen is on the TODO list.
>>
>>
>> Performance Impact:
>> ==================
>> On my MacBookPro with corei7, this change see about 2% odd improvement on crafty at CPU2kint fed with
>> Ref input (26.6s vs 26.0s, measured by at least N times).
>>
>> The crafty is compiled with "-msse4" to enable the opt. As I mentioned above, the pop-count loop
>> will be replaced into a single POPCNT instruction. Maybe it is more desirable to be replaced with a
>> SSE3* instruction sequences as it is faster than POPCNT instruction.
>>
>> Tons of thanks in advance!
>>
>> Best Regards
>> Shuxin
>> <popcount.patch>_______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
A non-text attachment was scrubbed...
Name: popcount.v2.patch
Type: text/x-patch
Size: 27889 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20121126/f22c1a64/attachment.bin>
More information about the llvm-commits
mailing list