[llvm-commits] [llvm] r168931 - in /llvm/trunk: include/llvm/Target/TargetTransformImpl.h include/llvm/TargetTransformInfo.h lib/Target/X86/X86ISelLowering.cpp lib/Target/X86/X86ISelLowering.h lib/Target/X86/X86TargetMachine.h lib/Transforms/Scalar/LoopIdiomRecognize.cpp test/Transforms/LoopIdiom/popcnt.ll

Wed Jan 9 10:31:01 PST 2013

Chandler, Thank you for your feedback, see the following inline comments

>     for the loop like this, slow HW support may outperform the SW version:
>
>      for (i = 0; I < 32; I++)
>        cnt += (val >> 1) & 1;
>
>
> I don't understand your response. My comment was asking why the 
> constraints in this comment aren't reflected in the name of the 
> interface. For example "getPopcntSupport" producing a simple "slow" 
> kind of support doesn't convey under what conditions it is slow, or 
> what the expected use case is. I was wondering about maybe 
> 'isSparsePopcntFast' and/or 'isDensePopcntFast'... You could in theory 
> take an expected density as input, but that seems like overkill.
>
> Anyways, I certainly agree about needing to factor the density into 
> the test, just was looking for an interface that made it more obvious 
> in addition to the comment.
>
> The other thing that might help here is to clarify what this routine 
> is really describing. It could mean at least any of the following:
> 1) Is there fast, direct hardware support for computing popcnt on an 
> integer? (IE, the popcnt instruction on x86)
> 2) Is there a fast, hardware accelerated form of computing popcnt? 
> (IE, lzcnt, tzcnt, potentially even bsr or other techniques on x86)
> 3) Is there a fast lowering of the popcnt builtin on this target (IE, 
> there are very fast SSSE3 and SSE2 implementations, although the ones 
> I know of here target vectors... there are also all kinds of clever 
> software implementations that we could emit that would be faster than 
> the loop...)
>
> I think the intrinsic is currently modeling #1 based on the name, but 
> the comment and x86 TODO make me wonder if you actually meant 
> something else? (IMO, eventually it might make sense to model #3, but 
> we'd need to teach LLVM to do all the clever tricks in lowering of 
> course... hmmm...)
>
If the "val" take value 0, then this loop has only one iteration, and 
hence "sparse" case.
In this sparse case, the loop may only take few cycles. The underlying 
HW support, which
could be implemented by a couple of instructions, could take longer time 
to finish.
Depending on how fast the HW support is, replacing the loop into HW 
support doesn't necessarily win.
That said, if the HW's support is super fast,  it is safe to blindly 
convert any pattern into the popcount intrinsic. ]
Otherwise, we must be careful about it.

This is an example of "dense" population count:

  for (i = 0; i < 32; i++)
      cnt += val & (1<<i)

It blindly take 32 iterations. In this case, we can convert to popount 
intrinsic regardless the HW support is slow or not.

>
>>         +  virtual PopcntHwSupport getPopcntHwSupport(unsigned
>>         IntTyWidthInBit) const {
>>
>>
>>     Why accept the integer type width? does it matter other than it
>>     being a legal integer width for the target?
>     The cost are different for int-width.
>
>
> Are they different for different integer width, or different expected 
> density of one bits? I'm not familiar with ARM, but for x86, the 
> instruction latencies and throughput tables I have access to only list 
> Bulldozer as having different throughput for different integer widths, 
> and we don't even model that in the implementation of this routine.
The HW support is not necessarily conducted by a *single* instruction.  
Wide-integer likely take more instructions
to compute. It is architecture dependent.

Even if for x8664,  it has direct HW support only in SSE 4.1. Before 
that we have to use couple of SSE3 instructions to compute
the popcount. Please also note that the SSE3 instructions sequence beat 
SSE4.1 popcnt instruction in performance.

>>         +    return None;
>>         +  }
>>          };
>>
>>          /// VectorTargetTransformInfo - This interface is used by
>>         the vectorizers
>>
>>         Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>>         URL:
>>         http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=168931&r1=168930&r2=168931&view=diff
>>         ==============================================================================
>>         --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
>>         +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Thu Nov 29
>>         13:38:54 2012
>>         @@ -17670,6 +17670,17 @@
>>            return -1;
>>          }
>>
>>         +ScalarTargetTransformInfo::PopcntHwSupport
>>         +X86ScalarTargetTransformImpl::getPopcntHwSupport(unsigned
>>         TyWidth) const {
>>         +  assert(isPowerOf2_32(TyWidth) && "Ty width must be power
>>         of 2");
>>
>>
>>     This constraint seems strange. It certainly doesn't seem like it
>>     should be an assert.
>     It catch the case when this function is fed with nonsense
>     parameter, and it at least make sense for x86
>
>
> I understand, but I disagree with that design. Specifically, at the IR 
> level we have integer types that are not powers of two width. If this 
> is an important constraint, than it should be very clearly documented, 
> but it would seem more reasonable in a layer like the TTI interface to 
> either A) return 'none' or whatever the conservative answer is, or B) 
> accept the IR-level type, run it through legalization to see the 
> legalized type it ends up with, and use that width.
>
> I would suggest B here as it seems a more robust solution and to 
> follow the spirit of TTI -- modeling the result of CodeGen when 
> lowering the code.
I just want to make sure the interface is very simple. If the width is 
not power-of-2, we can simply ignore it
for the sake of simplicity.
>
>>
>>     If i have an i24, i cant count the population of it by counting
>>     the population of it zext'ed to 32 bits...
>     You have to pass 32 when calling this function.
>
>
> I mean't to say 'i can count', see above. I think the best solution is 
> to pass in the IR-level type directly...
>
>>         +  const X86Subtarget &ST =
>>         TLI->getTargetMachine().getSubtarget<X86Subtarget>();
>>         +
>>         +  // TODO: Currently the __builtin_popcount() implementation
>>         using SSE3
>>         +  //   instructions is inefficient. Once the problem is
>>         fixed, we should
>>         +  //   call ST.hasSSE3() instead of ST.hasSSE4().
>>
>>
>>     __builtin_popcount isn't the issue here, this is in LLVM. What is
>>     the actual issue? Why isn't "Slow' used here?
>     Please check my original mail.  It is a real LLVM defect in
>     generate optimal instruction sequence using SSE3.
>
>
> When I said '__builtin_popcount' I was referring to the Clang builtin 
> function. The LLVM defect has to do with llvm.ctpop implementation. I 
> just didn't want a future reader to be confused between Clang builtins 
> and LLVM intrinsics.
>
>>
>>           namespace {
>>         +
>>         +  class LoopIdiomRecognize;
>>         +
>>         +  /// This class defines some utility functions for loop
>>         idiom recognization.
>>
>>
>>     Please don't use a class to organize a collection of static
>>     methods. That is the role of a namespace.
>     Some reviewers hate this idea.  It is not hard to imagine, this
>     class will become more and more complicate.
>     At that point, it will not be a container for static functions any
>     more.
>
>
> But until that day arrives, I think you should use a more appropriate 
> software design pattern. I follow the YAGNI design philosophy here.
I don't think it is about design patten. It is just about how to 
organize the related functions.
I can pull these static functions out of the class if they seems to be a 
real eyesore.

>
>
>>
>>     And this doesn't need a namespace at all -- this is in a single
>>     .cc file. Please just use static functions, that is is the
>>     overwhelmingly predominant pattern in LLVM.
>>
>>         +  class LIRUtil {
>>         +  public:
>>         +    /// Return true iff the block contains nothing but an
>>         uncondition branch
>>         +    /// (aka goto instruction).
>>         +    static bool isAlmostEmpty(BasicBlock *);
>>
>>
>>     Please use names for parameters in LLVM
>>
>>     Also, I don't think this needs a helper method. First, you
>>     *always* have a terminator instruction, the verify assures this.
>>     Typically, you would then test that &BB->begin() ==
>>     BB->getTerminatorInst().
>
>>
>>     Or is the point of this function to test for an empty BB *and* an
>>     unconditional branch? If so, why do you need to handle that --
>>     SimplifyCFG should have removed such blocks already?
>     I just to make sure this code work for any input.
>     I don't want to assume the input is CFG optimized.
>
>
> That isn't how LLVM is architected. A very fundamental design 
> principle of the IR optimizers is to rely on other optimization passes 
> to simplify and canonicalize the IR so that each individual 
> transformation can be simpler and not handle all cases. While 
> sometimes LLVM does duplicate simplification logic, it is usually to 
> handle very specific phase ordering problems, which don't seem to be 
> the case here.
In general I agree with you.  But, it dose not hurt to be safe at a very 
low cost.

>>         +
>>         +    static BranchInst *getBranch(BasicBlock *BB) {
>>         +      return dyn_cast<BranchInst>(BB->getTerminator());
>>         +    }
>>
>>
>>     Abstracting this into a method doesn't seem to help readability
>>     at all. I would rather just see the dyn_cast.
>>
>>         +
>>         +    /// Return the condition of the branch terminating the
>>         given basic block.
>>         +    static Value *getBrCondtion(BasicBlock *);
>>
>>
>>     What if it doesn't terminate in a branch? I'm not really sure
>>     what abstraction you're trying to provide here.
>     Then return 0, the tiny code says that.
>
>
> While the code is obvious, the code isn't here. This is a place where 
> having just a static helper function w/o the forward declaration and 
> interface comment would simplify things.
>
> That said, I still don't really see why this is worth hoisting into a 
> separate function rather than writing out the code.
It was inlined in the code. However, this code is called in several 
place in the same function.
I factor it out just to reduce the clutter.

Patten match code usually looks nasty, I have got to make it is concise 
enough to make it easier for reader.

>>
>>         +
>>         +    /// Derive the precondition block (i.e the block that
>>         guards the loop
>>         +    /// preheader) from the given preheader.
>>         +    static BasicBlock *getPrecondBb(BasicBlock *PreHead);
>>
>>
>>     I don't understand this.. What is the invariant it exposes? What
>>     does it test first? Why is this not a method on LoopInfo if this
>>     is generally useful?
>     I'm afraid I don't understand what you are saying?
>     Did I say "invariant"?
>
>
> No, I'm to understand what exactly you mean by a precondition block, 
> because looking at the implementation of this routine, it does 
> something something extremely specific: it finds a single predecessor 
> of the preheader which terminates in a conditional branch. And It's 
> still not clear why this is not one of the methods on LoopInfo.
By "precondition block", the I mean the block containing the 
precondition to enter the loop.

e.g. for (i = 0; i < n; i++) { ... };  has a precondition block testing 
"i < n" (after the loop is inversed).
    however "do { } while (cond);" dose not have such precondition block.

    If precondition block present, I try to place the instrinsic func in 
the precondition block in a hope to completely git rid of the 
precondition block.

    If just place the intrinsic function in the preheader, I end up with 
following code :
    if (precondition) {
         popcount.intrinsic.
    }

    If you still don't understand, try to write a snippet with memcpy 
pattern.  You will find the precondition still remains
after the pattern is recoganized.  This is one defect I'm going to fix.

>>         +  };
>>         +
>>         +  /// This class is to recoginize idioms of population-count
>>         conducted in
>>         +  /// a noncountable loop. Currently it only recognizes this
>>         pattern:
>>         +  /// \code
>>         +  ///   while(x) {cnt++; ...; x &= x - 1; ...}
>>         +  /// \endcode
>>         +  class NclPopcountRecognize {
>>
>>
>>     If NCL is an acronym you want to use, capitalize it as an
>>     acronym. I don't think it is a terribly useful acronym. I would
>>     call this just PopcountLoopRecognizer (note a noun, as this is a
>>     class), and document what sets of loops it handles currently. If
>>     there is some reason we wouldn't re-use this infrastructure for
>>     any future extensions, you should explain that here.
>     Population count has tons of variants.
>     The loops could be countable loop or non-countable loop.  I don't
>     think it is good idea to recognize all variants in one class.
>     Since the way we recognize a pattern in non-countable loop and
>     countable loop are quite different.
>     We have some way to differentiate them in the name.
>
>
> But we don't have any of these other variants yet. Until we do, I'm 
> advocating for a simpler design as there is nothing do differentiate.
Nope, we still have couple of variant.  One stupid and most popular one 
being:

  for (i = 0; i < n;i++)
     cnt += (val & (i>>i));

>     If "NCL" doesn't seem to be a good one, what is you suggession?
>
>
>
>>
>>
>>     Also, have you thought much about the design of these classes?
>>     Currently, we have LoopIdiomRecognize which is a pass, and which
>>     has innate logic for forming memset intrinsics out of loops. But
>>     we now also have a non-pass helper class to recognize a specific
>>     kind of loop. How would you envision this design scaling up to
>>     the next loop idiom?
>>
>     Yes, I did.
>     It was a big class trying to recognize all interesting pattern. I
>     don't want to make it become messy.
>     That is why I implement the popcount in a separate class.
>     In the future, I'd like new non-trivial recognization is also
>     implemented in separate class as well.
>     That said, I don't like pull what we have in the pass to separate
>     classes.
>     We have to make change one step at a time.
>
>
> Yea, I agree we need some way to partition stuff up. I like pulling 
> things out, but I think it would be useful to get some pattern in 
> place across the loop idioms. I'm not yet sure what the best pattern 
> is though, which is why I was asking your thoughts on the subject.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130109/1d9cb4b9/attachment.html>