[llvm-commits] [llvm] r168931 - in /llvm/trunk: include/llvm/Target/TargetTransformImpl.h include/llvm/TargetTransformInfo.h lib/Target/X86/X86ISelLowering.cpp lib/Target/X86/X86ISelLowering.h lib/Target/X86/X86TargetMachine.h lib/Transforms/Scalar/LoopIdiomRecognize.cpp test/Transforms/LoopIdiom/popcnt.ll

Wed Jan 9 01:47:07 PST 2013

Trying to respond to your comments inline.

On Fri, Dec 7, 2012 at 5:58 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:

>
>
>> +  /// performance if the population is sparse. A HW support is
>> considered as
>> +  /// "Fast" if it can outperform, or is on a par with, SW implementaion
>> when
>> +  /// the population is sparse; otherwise, it is considered as "Slow".
>>
>
>  I wonder, why not express this in the interface? You could query for
> fast/slow with an expected density?
>
> Why not?
> for the loop like this, slow HW support may outperform the SW version:
>
>  for (i = 0; I < 32; I++)
>    cnt += (val >> 1) & 1;
>

I don't understand your response. My comment was asking why the constraints
in this comment aren't reflected in the name of the interface. For example
"getPopcntSupport" producing a simple "slow" kind of support doesn't convey
under what conditions it is slow, or what the expected use case is. I was
wondering about maybe 'isSparsePopcntFast' and/or 'isDensePopcntFast'...
You could in theory take an expected density as input, but that seems like
overkill.

Anyways, I certainly agree about needing to factor the density into the
test, just was looking for an interface that made it more obvious in
addition to the comment.

The other thing that might help here is to clarify what this routine is
really describing. It could mean at least any of the following:
1) Is there fast, direct hardware support for computing popcnt on an
integer? (IE, the popcnt instruction on x86)
2) Is there a fast, hardware accelerated form of computing popcnt? (IE,
lzcnt, tzcnt, potentially even bsr or other techniques on x86)
3) Is there a fast lowering of the popcnt builtin on this target (IE, there
are very fast SSSE3 and SSE2 implementations, although the ones I know of
here target vectors... there are also all kinds of clever software
implementations that we could emit that would be faster than the loop...)

I think the intrinsic is currently modeling #1 based on the name, but the
comment and x86 TODO make me wonder if you actually meant something else?
(IMO, eventually it might make sense to model #3, but we'd need to teach
LLVM to do all the clever tricks in lowering of course... hmmm...)

>  +  virtual PopcntHwSupport getPopcntHwSupport(unsigned IntTyWidthInBit)
>> const {
>>
>
>  Why accept the integer type width? does it matter other than it being a
> legal integer width for the target?
>
> The cost are different for int-width.
>

Are they different for different integer width, or different expected
density of one bits? I'm not familiar with ARM, but for x86, the
instruction latencies and throughput tables I have access to only list
Bulldozer as having different throughput for different integer widths, and
we don't even model that in the implementation of this routine.

>
>
>> +    return None;
>> +  }
>>  };
>>
>>  /// VectorTargetTransformInfo - This interface is used by the vectorizers
>>
>> Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>> URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=168931&r1=168930&r2=168931&view=diff
>>
>> ==============================================================================
>> --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
>> +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Thu Nov 29 13:38:54 2012
>> @@ -17670,6 +17670,17 @@
>>    return -1;
>>  }
>>
>> +ScalarTargetTransformInfo::PopcntHwSupport
>> +X86ScalarTargetTransformImpl::getPopcntHwSupport(unsigned TyWidth) const
>> {
>> +  assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
>
>
>  This constraint seems strange. It certainly doesn't seem like it should
> be an assert.
>
> It catch the case when this function is fed with nonsense parameter, and
> it at least make sense for x86
>

I understand, but I disagree with that design. Specifically, at the IR
level we have integer types that are not powers of two width. If this is an
important constraint, than it should be very clearly documented, but it
would seem more reasonable in a layer like the TTI interface to either A)
return 'none' or whatever the conservative answer is, or B) accept the
IR-level type, run it through legalization to see the legalized type it
ends up with, and use that width.

I would suggest B here as it seems a more robust solution and to follow the
spirit of TTI -- modeling the result of CodeGen when lowering the code.

>
>  If i have an i24, i cant count the population of it by counting the
> population of it zext'ed to 32 bits...
>
> You have to pass 32 when calling this function.
>

I mean't to say 'i can count', see above. I think the best solution is to
pass in the IR-level type directly...

>
>>
>
>  +  const X86Subtarget &ST =
>> TLI->getTargetMachine().getSubtarget<X86Subtarget>();
>> +
>> +  // TODO: Currently the __builtin_popcount() implementation using SSE3
>> +  //   instructions is inefficient. Once the problem is fixed, we should
>> +  //   call ST.hasSSE3() instead of ST.hasSSE4().
>>
>
>  __builtin_popcount isn't the issue here, this is in LLVM. What is the
> actual issue? Why isn't "Slow' used here?
>
> Please check my original mail.  It is a real LLVM defect in generate
> optimal instruction sequence using SSE3.
>

When I said '__builtin_popcount' I was referring to the Clang builtin
function. The LLVM defect has to do with llvm.ctpop implementation. I just
didn't want a future reader to be confused between Clang builtins and LLVM
intrinsics.

>
>
>>   namespace {
>> +
>> +  class LoopIdiomRecognize;
>> +
>> +  /// This class defines some utility functions for loop idiom
>> recognization.
>>
>
>  Please don't use a class to organize a collection of static methods.
> That is the role of a namespace.
>
> Some reviewers hate this idea.  It is not hard to imagine, this class will
> become more and more complicate.
> At that point, it will not be a container for static functions any more.
>

But until that day arrives, I think you should use a more appropriate
software design pattern. I follow the YAGNI design philosophy here.

>
>  And this doesn't need a namespace at all -- this is in a single .cc
> file. Please just use static functions, that is is the overwhelmingly
> predominant pattern in LLVM.
>
>
>> +  class LIRUtil {
>> +  public:
>> +    /// Return true iff the block contains nothing but an uncondition
>> branch
>> +    /// (aka goto instruction).
>> +    static bool isAlmostEmpty(BasicBlock *);
>>
>
>  Please use names for parameters in LLVM
>
>
>  Also, I don't think this needs a helper method. First, you *always* have
> a terminator instruction, the verify assures this. Typically, you would
> then test that &BB->begin() == BB->getTerminatorInst().
>
>
>
>  Or is the point of this function to test for an empty BB *and* an
> unconditional branch? If so, why do you need to handle that -- SimplifyCFG
> should have removed such blocks already?
>
> I just to make sure this code work for any input.
> I don't want to assume the input is CFG optimized.
>

That isn't how LLVM is architected. A very fundamental design principle of
the IR optimizers is to rely on other optimization passes to simplify and
canonicalize the IR so that each individual transformation can be simpler
and not handle all cases. While sometimes LLVM does duplicate
simplification logic, it is usually to handle very specific phase ordering
problems, which don't seem to be the case here.

>
>
>> +
>> +    static BranchInst *getBranch(BasicBlock *BB) {
>> +      return dyn_cast<BranchInst>(BB->getTerminator());
>> +    }
>>
>
>  Abstracting this into a method doesn't seem to help readability at all.
> I would rather just see the dyn_cast.
>
>
>> +
>> +    /// Return the condition of the branch terminating the given basic
>> block.
>> +    static Value *getBrCondtion(BasicBlock *);
>>
>
>  What if it doesn't terminate in a branch? I'm not really sure what
> abstraction you're trying to provide here.
>
> Then return 0, the tiny code says that.
>

While the code is obvious, the code isn't here. This is a place where
having just a static helper function w/o the forward declaration and
interface comment would simplify things.

That said, I still don't really see why this is worth hoisting into a
separate function rather than writing out the code.

>
>  +
>> +    /// Derive the precondition block (i.e the block that guards the loop
>> +    /// preheader) from the given preheader.
>> +    static BasicBlock *getPrecondBb(BasicBlock *PreHead);
>>
>
>  I don't understand this.. What is the invariant it exposes? What does it
> test first? Why is this not a method on LoopInfo if this is generally
> useful?
>
> I'm afraid I don't understand what you are saying?
> Did I say "invariant"?
>

No, I'm to understand what exactly you mean by a precondition block,
because looking at the implementation of this routine, it does something
something extremely specific: it finds a single predecessor of the
preheader which terminates in a conditional branch. And It's still not
clear why this is not one of the methods on LoopInfo.

>
>
>> +  };
>> +
>> +  /// This class is to recoginize idioms of population-count conducted in
>> +  /// a noncountable loop. Currently it only recognizes this pattern:
>> +  /// \code
>> +  ///   while(x) {cnt++; ...; x &= x - 1; ...}
>> +  /// \endcode
>> +  class NclPopcountRecognize {
>>
>
>  If NCL is an acronym you want to use, capitalize it as an acronym. I
> don't think it is a terribly useful acronym. I would call this just
> PopcountLoopRecognizer (note a noun, as this is a class), and document what
> sets of loops it handles currently. If there is some reason we wouldn't
> re-use this infrastructure for any future extensions, you should explain
> that here.
>
> Population count has tons of variants.
> The loops could be countable loop or non-countable loop.  I don't think it
> is good idea to recognize all variants in one class.
> Since the way we recognize a pattern in non-countable loop and countable
> loop are quite different.
> We have some way to differentiate them in the name.
>

But we don't have any of these other variants yet. Until we do, I'm
advocating for a simpler design as there is nothing do differentiate.

> If "NCL" doesn't seem to be a good one, what is you suggession?
>

>
>
>
>  Also, have you thought much about the design of these classes?
> Currently, we have LoopIdiomRecognize which is a pass, and which has innate
> logic for forming memset intrinsics out of loops. But we now also have a
> non-pass helper class to recognize a specific kind of loop. How would you
> envision this design scaling up to the next loop idiom?
>
>    Yes, I did.
> It was a big class trying to recognize all interesting pattern. I don't
> want to make it become messy.
> That is why I implement the popcount in a separate class.
> In the future, I'd like new non-trivial recognization is also implemented
> in separate class as well.
> That said, I don't like pull what we have in the pass to separate classes.
> We have to make change one step at a time.
>

Yea, I agree we need some way to partition stuff up. I like pulling things
out, but I think it would be useful to get some pattern in place across the
loop idioms. I'm not yet sure what the best pattern is though, which is why
I was asking your thoughts on the subject.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130109/1b923b5e/attachment.html>