<div dir="ltr"><div class="gmail_default" style>Trying to respond to your comments inline.</div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 7, 2012 at 5:58 PM, Shuxin Yang <span dir="ltr"><<a href="mailto:shuxin.llvm@gmail.com" target="_blank" class="cremed">shuxin.llvm@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000"><div class="im"><blockquote type="cite"><div style="font-family:arial,helvetica,sans-serif;font-size:10pt"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +  /// performance if the population is sparse. A HW

                support is considered as<br>

                +  /// "Fast" if it can outperform, or is on a par with,

                SW implementaion when<br>

                +  /// the population is sparse; otherwise, it is

                considered as "Slow".<br>

              </blockquote>

              <div><br>

              </div>

              <div>I wonder, why not express this in the

                interface? You could query for fast/slow with an

                expected density?</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    Why not?<br>

    for the loop like this, slow HW support may outperform the SW

    version:<br>

    <br>

     for (i = 0; I < 32; I++)<br>

       cnt += (val >> 1) & 1;</div></blockquote><div><br></div><div style>I don't understand your response. My comment was asking why the constraints in this comment aren't reflected in the name of the interface. For example "getPopcntSupport" producing a simple "slow" kind of support doesn't convey under what conditions it is slow, or what the expected use case is. I was wondering about maybe 'isSparsePopcntFast' and/or 'isDensePopcntFast'... You could in theory take an expected density as input, but that seems like overkill.</div>

<div style><br></div><div style>Anyways, I certainly agree about needing to factor the density into the test, just was looking for an interface that made it more obvious in addition to the comment.</div><div style><br></div>

<div style>The other thing that might help here is to clarify what this routine is really describing. It could mean at least any of the following:</div><div style>1) Is there fast, direct hardware support for computing popcnt on an integer? (IE, the popcnt instruction on x86)</div>

<div style>2) Is there a fast, hardware accelerated form of computing popcnt? (IE, lzcnt, tzcnt, potentially even bsr or other techniques on x86)</div><div style>3) Is there a fast lowering of the popcnt builtin on this target (IE, there are very fast SSSE3 and SSE2 implementations, although the ones I know of here target vectors... there are also all kinds of clever software implementations that we could emit that would be faster than the loop...)</div>

<div style><br></div><div style>I think the intrinsic is currently modeling #1 based on the name, but the comment and x86 TODO make me wonder if you actually meant something else? (IMO, eventually it might make sense to model #3, but we'd need to teach LLVM to do all the clever tricks in lowering of course... hmmm...)</div>

<div style><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div class="im"><br><blockquote type="cite"><div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +  virtual PopcntHwSupport getPopcntHwSupport(unsigned

                IntTyWidthInBit) const {<br>

              </blockquote>

              <div><br>

              </div>

              <div>Why accept the integer type width? does it

                matter other than it being a legal integer width for the

                target?</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    The cost are different for int-width.<br></div></blockquote><div><br></div><div style>Are they different for different integer width, or different expected density of one bits? I'm not familiar with ARM, but for x86, the instruction latencies and throughput tables I have access to only list Bulldozer as having different throughput for different integer widths, and we don't even model that in the implementation of this routine.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div class="im"><blockquote type="cite"><div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +    return None;<br>

                +  }<br>

                 };<br>

                <br>

                 /// VectorTargetTransformInfo - This interface is used

                by the vectorizers<br>

                <br>

                Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp<br>

                URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=168931&r1=168930&r2=168931&view=diff" target="_blank" class="cremed">http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=168931&r1=168930&r2=168931&view=diff</a><br>

==============================================================================<br>

                --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

                (original)<br>

                +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Thu

                Nov 29 13:38:54 2012<br>

                @@ -17670,6 +17670,17 @@<br>

                   return -1;<br>

                 }<br>

                <br>

                +ScalarTargetTransformInfo::PopcntHwSupport<br>

                +X86ScalarTargetTransformImpl::getPopcntHwSupport(unsigned

                TyWidth) const {<br>

                +  assert(isPowerOf2_32(TyWidth) && "Ty width

                must be power of 2");</blockquote>

              <div><br>

              </div>

              <div>This constraint seems strange. It certainly

                doesn't seem like it should be an assert.</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    It catch the case when this function is fed with nonsense parameter,

    and it at least make sense for x86</div></blockquote><div><br></div><div style>I understand, but I disagree with that design. Specifically, at the IR level we have integer types that are not powers of two width. If this is an important constraint, than it should be very clearly documented, but it would seem more reasonable in a layer like the TTI interface to either A) return 'none' or whatever the conservative answer is, or B) accept the IR-level type, run it through legalization to see the legalized type it ends up with, and use that width.</div>

<div><br></div><div style>I would suggest B here as it seems a more robust solution and to follow the spirit of TTI -- modeling the result of CodeGen when lowering the code.</div><div style> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div bgcolor="#FFFFFF" text="#000000"><div class="im"><blockquote type="cite"><div style="font-family:arial,helvetica,sans-serif;font-size:10pt"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br>

              </div>

              <div>If i have an i24, i cant count the

                population of it by counting the population of it

                zext'ed to 32 bits...</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    You have to pass 32 when calling this function.<br></div></blockquote><div><br></div><div style>I mean't to say 'i can count', see above. I think the best solution is to pass in the IR-level type directly...</div>

<div style><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div class="im">

    <blockquote type="cite">

      <div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

        <div dir="ltr">

          <div class="gmail_extra">

            <div class="gmail_quote">

              <div> </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                 </blockquote>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +  const X86Subtarget &ST =

                TLI->getTargetMachine().getSubtarget<X86Subtarget>();<br>

                +<br>

                +  // TODO: Currently the __builtin_popcount()

                implementation using SSE3<br>

                +  //   instructions is inefficient. Once the problem is

                fixed, we should<br>

                +  //   call ST.hasSSE3() instead of ST.hasSSE4().<br>

              </blockquote>

              <div><br>

              </div>

              <div>__builtin_popcount isn't the issue here,

                this is in LLVM. What is the actual issue? Why isn't

                "Slow' used here?</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    Please check my original mail.  It is a real LLVM defect in generate

    optimal instruction sequence using SSE3.<br></div></blockquote><div><br></div><div style>When I said '__builtin_popcount' I was referring to the Clang builtin function. The LLVM defect has to do with llvm.ctpop implementation. I just didn't want a future reader to be confused between Clang builtins and LLVM intrinsics.</div>

<div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div class="im">

    <blockquote type="cite">

      <div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

        <div dir="ltr">

          <div class="gmail_extra">

            <div class="gmail_quote">

              <div> <br>

              </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> 

                namespace {<br>

                +<br>

                +  class LoopIdiomRecognize;<br>

                +<br>

                +  /// This class defines some utility functions for

                loop idiom recognization.<br>

              </blockquote>

              <div><br>

              </div>

              <div>Please don't use a class to organize a

                collection of static methods. That is the role of a

                namespace.</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    Some reviewers hate this idea.  It is not hard to imagine, this

    class will become more and more complicate.<br>

    At that point, it will not be a container for static functions any

    more.<br></div></blockquote><div><br></div><div style>But until that day arrives, I think you should use a more appropriate software design pattern. I follow the YAGNI design philosophy here.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div bgcolor="#FFFFFF" text="#000000"><div class="im">

    <br>

    <blockquote type="cite">

      <div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

        <div dir="ltr">

          <div class="gmail_extra">

            <div class="gmail_quote">

              <div><br>

              </div>

              <div>And this doesn't need a namespace at all --

                this is in a single .cc file. Please just use static

                functions, that is is the overwhelmingly predominant

                pattern in LLVM.</div>

              <div> </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +  class LIRUtil {<br>

                +  public:<br>

                +    /// Return true iff the block contains nothing but

                an uncondition branch<br>

                +    /// (aka goto instruction).<br>

                +    static bool isAlmostEmpty(BasicBlock *);<br>

              </blockquote>

              <div><br>

              </div>

              <div>Please use names for parameters in LLVM</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    </div><blockquote type="cite">

      <div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

        <div dir="ltr">

          <div class="gmail_extra">

            <div class="gmail_quote">

              <div><br>

              </div><div class="im">

              <div>Also, I don't think this needs a helper

                method. First, you *always* have a terminator

                instruction, the verify assures this. Typically, you

                would then test that &BB->begin() ==

                BB->getTerminatorInst().</div>

            </div></div>

          </div>

        </div>

      </div>

    </blockquote><div class="im">

    <br>

    <blockquote type="cite">

      <div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

        <div dir="ltr">

          <div class="gmail_extra">

            <div class="gmail_quote">

              <div><br>

              </div>

              <div>Or is the point of this function to test for

                an empty BB *and* an unconditional branch? If so, why do

                you need to handle that -- SimplifyCFG should have

                removed such blocks already?</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    I just to make sure this code work for any input. <br>

    I don't want to assume the input is CFG optimized.<br></div></blockquote><div><br></div><div style>That isn't how LLVM is architected. A very fundamental design principle of the IR optimizers is to rely on other optimization passes to simplify and canonicalize the IR so that each individual transformation can be simpler and not handle all cases. While sometimes LLVM does duplicate simplification logic, it is usually to handle very specific phase ordering problems, which don't seem to be the case here.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div class="im"><blockquote type="cite"><div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +<br>

                +    static BranchInst *getBranch(BasicBlock *BB) {<br>

                +      return

                dyn_cast<BranchInst>(BB->getTerminator());<br>

                +    }<br>

              </blockquote>

              <div><br>

              </div>

              <div>Abstracting this into a method doesn't seem

                to help readability at all. I would rather just see the

                dyn_cast.</div>

              <div> </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +<br>

                +    /// Return the condition of the branch terminating

                the given basic block.<br>

                +    static Value *getBrCondtion(BasicBlock *);<br>

              </blockquote>

              <div><br>

              </div>

              <div>What if it doesn't terminate in a branch?

                I'm not really sure what abstraction you're trying to

                provide here.</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    Then return 0, the tiny code says that.<br></div></blockquote><div><br></div><div style>While the code is obvious, the code isn't here. This is a place where having just a static helper function w/o the forward declaration and interface comment would simplify things.</div>

<div style><br></div><div style>That said, I still don't really see why this is worth hoisting into a separate function rather than writing out the code.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div bgcolor="#FFFFFF" text="#000000"><div class="im"><blockquote type="cite"><div style="font-family:arial,helvetica,sans-serif;font-size:10pt"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br>

              </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +<br>

                +    /// Derive the precondition block (i.e the block

                that guards the loop<br>

                +    /// preheader) from the given preheader.<br>

                +    static BasicBlock *getPrecondBb(BasicBlock

                *PreHead);<br>

              </blockquote>

              <div><br>

              </div>

              <div>I don't understand this.. What is the

                invariant it exposes? What does it test first? Why is

                this not a method on LoopInfo if this is generally

                useful?</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    I'm afraid I don't understand what you are saying? <br>

    Did I say "invariant"?<br></div></blockquote><div><br></div><div style>No, I'm to understand what exactly you mean by a precondition block, because looking at the implementation of this routine, it does something something extremely specific: it finds a single predecessor of the preheader which terminates in a conditional branch. And It's still not clear why this is not one of the methods on LoopInfo.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div class="im"><blockquote type="cite"><div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> </div>

              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                +  };<br>

                +<br>

                +  /// This class is to recoginize idioms of

                population-count conducted in<br>

                +  /// a noncountable loop. Currently it only recognizes

                this pattern:<br>

                +  /// \code<br>

                +  ///   while(x) {cnt++; ...; x &= x - 1; ...}<br>

                +  /// \endcode<br>

                +  class NclPopcountRecognize {<br>

              </blockquote>

              <div><br>

              </div>

              <div>If NCL is an acronym you want to use,

                capitalize it as an acronym. I don't think it is a

                terribly useful acronym. I would call this just

                PopcountLoopRecognizer (note a noun, as this is a

                class), and document what sets of loops it handles

                currently. If there is some reason we wouldn't re-use

                this infrastructure for any future extensions, you

                should explain that here.</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    Population count has tons of variants. <br>

    The loops could be countable loop or non-countable loop.  I don't

    think it is good idea to recognize all variants in one class. <br>

    Since the way we recognize a pattern in non-countable loop and

    countable loop are quite different. <br>

    We have some way to differentiate them in the name.<br></div></blockquote><div><br></div><div style>But we don't have any of these other variants yet. Until we do, I'm advocating for a simpler design as there is nothing do differentiate.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    If "NCL" doesn't seem to be a good one, what is you suggession? </div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

<div class="im"><br>

    <br>

    <blockquote type="cite">

      <div style="font-family:arial,helvetica,sans-serif;font-size:10pt">

        <div dir="ltr">

          <div class="gmail_extra">

            <div class="gmail_quote">

              <div><br>

              </div>

              <div><br>

              </div>

              <div>Also, have you thought much about the design

                of these classes? Currently, we have LoopIdiomRecognize

                which is a pass, and which has innate logic for forming

                memset intrinsics out of loops. But we now also have a

                non-pass helper class to recognize a specific kind of

                loop. How would you envision this design scaling up to

                the next loop idiom?</div>

              <div><br>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote></div>

    Yes, I did. <br>

    It was a big class trying to recognize all interesting pattern. I

    don't want to make it become messy. <br>

    That is why I implement the popcount in a separate class. <br>

    In the future, I'd like new non-trivial recognization is also

    implemented in separate class as well. <br>

    That said, I don't like pull what we have in the pass to separate

    classes. <br>

    We have to make change one step at a time.<br></div></blockquote><div><br></div><div style>Yea, I agree we need some way to partition stuff up. I like pulling things out, but I think it would be useful to get some pattern in place across the loop idioms. I'm not yet sure what the best pattern is though, which is why I was asking your thoughts on the subject.</div>

</div></div></div>