<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 1/31/19 1:14 PM, Robin Kruppe wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAJrduR40SkYEXDsxX0+sKo0bfvS-j-o4=23UfvvaSWsMBvffKQ@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr"><br>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Thu, 31 Jan 2019 at

            20:17, Philip Reames via llvm-dev <<a

              href="mailto:llvm-dev@lists.llvm.org" target="_blank"

              moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex"><br>

            On 1/31/19 11:03 AM, David Greene wrote:<br>

            > Philip Reames <<a

              href="mailto:listmail@philipreames.com" target="_blank"

              moz-do-not-send="true">listmail@philipreames.com</a>>

            writes:<br>

            ><br>

            >> Question 1 - Why do we need separate mask and

            lengths? Can't the<br>

            >> length be easily folded into the mask operand?<br>

            >><br>

            >> e.g. newmask = (<4 x i1>)((i4)%y & (1

            << %L -1))<br>

            >> and then pattern matched in the backend if needed<br>

            > I'm a little concerned about how difficult it will be

            to maintain enough<br>

            > information throughout compilation to be able to match

            this on a machine<br>

            > with an explicit vector length value.<br>

            Does the hardware *also* have a mask register?  If so, this

            is a likely <br>

            minor code quality issue which can be incrementally refined

            on.  If it <br>

            doesn't, then I can see your concern.<br>

          </blockquote>

          <div><br>

          </div>

          <div>Masking/predication is supported nearly universally, but

            I don't think the code quality issue is minor. It would be

            on a typical packed-SIMD machine with 128/256/512 bit

            registers, but the processors with a vector length register

            are usually built with much larger registers files and

            without a corresponding increase in the number of functional

            units. For example, 4096 bit per vector register is really

            quite modest for this kind of machine, while the data path

            can reasonable be "only" 128 or 256 bit.<br>

          </div>

          <div><br>

          </div>

          <div>This changes the calculus quite a bit: vector lengths

            much shorter or minimally larger than one full register are

            suddenly reasonable common (in application code, not so much

            in HPC kernels) and because each vector instruction is split

            into many data-path-sized uops, it's trivial and very

            rewarding to cut processing short halfway through a vector.

            The efficiency of "short vector code" then depends on the

            ability to finish each operation on those short vectors

            relatively quickly rather than padding everything to a full

            vector register. <br>

          </div>

          <div><br>

          </div>

          <div>For example, if a loop with a trip count of 20 is

            vectorized on a machine with 64 elements per vector (that's

            64b elements in a 4096b register, so this is lowballing

            it!), using only masks and not the vector length register

            makes your vector unit do about three times more work than

            it would have to if you set the vector length register to

            20. That keeps the register file and functional units busy

            for no good reason. Some microarchitectures take on the

            burden of determining when a whole chunk of the vector is

            masked out and can then skip over it quickly, but many

            others don't. So you're likely burning a whole bunch of

            power and quite possibly taking up cycles that could be

            filled with useful work from other instructions instead.</div>

        </div>

      </div>

    </blockquote>

    <p>Thank you for the explanation.  <br>

    </p>

    <p>Do such architectures frequently have arithmetic operations on

      the mask registers?  (i.e. can I reasonable compute a conservative

      length given a mask register value)  If I can, then having a mask

      as the canonical form and re-deriving the length register from a

      mask for a sequence of instructions which share a predicate seems

      fairly reasonable.  Note that I'm assuming this as a fallback, and

      that the common case is handled via the equivalent of

      ComputeKnownBits on the mask itself at compile time.  <br>

    </p>

    <p>The only case where the combination of a CKB and dynamic

      mask->length fallback wouldn't handle reliably is when we have

      a mask loaded from an external source (memory, function call

      boundary, etc...) and a short sequence of vector ops.  Are such

      really common enough that it needs to be a first class element of

      the design?</p>

    <p><br>

    </p>

    <p>p.s. To make sure my tone is coming across correctly, let me

      spell out that I'm not convinced, but I'm not actively objecting. 

      I'm playing devils advocate for the purposes of fleshing out a

      design, but if folks more knowledgeable than I strongly believe

      the right design requires both masks and lengths, I'm happy to

      defer on that point.  <br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:CAJrduR40SkYEXDsxX0+sKo0bfvS-j-o4=23UfvvaSWsMBvffKQ@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><br>

          </div>

          <div>Cheers,</div>

          <div>Robin<br>

          </div>

          <br>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            >> Question 2 - Have you explored using selects

            instead? What practical<br>

            >> problems do you run into which make you believe

            explicit predication<br>

            >> is required?<br>

            >><br>

            >> e.g. %sub = fsub <4 x float> %x, %y<br>

            >> %result = select <4 x i1> %M, <4 x

            float> %sub, undef<br>

            > That is semantically incorrect.  According to IR

            semantics, the fsub is<br>

            > fully evaluated before the select comes along.  It

            could trap for<br>

            > elements where %M is 0, whereas a masked intrinsic

            conveys the proper<br>

            > semantics of masking traps for masked-out elements.  We

            need intrinsics<br>

            > and eventually (IMHO) fully first-class predication to

            make this work<br>

            > properly.<br>

            <br>

            If you want specific trap behavior, you need to use the

            constrained <br>

            family of intrinsics instead.  In IR, fsub is expected not

            to trap.<br>

            <br>

            We have an existing solution for modeling FP environment

            aspects such as <br>

            rounding and trapping.  The proposed signatures for your EVL

            proposal do <br>

            not appear to subsume those, and you've not proposed their

            retirement.  <br>

            We definitely don't want *two* ways of describing FP

            trapping.<br>

            <br>

            In other words, I don't find this reason compelling since my

            example can <br>

            simply be rewritten using the appropriate constrained

            intrinsic.<br>

            <br>

            <br>

            ><br>

            >> My context for these questions is that my

            experience recently w/o<br>

            >> existing masked intrinsics shows us missing fairly

            basic<br>

            >> optimizations, precisely because they weren't able

            to reuse all of the<br>

            >> existing infrastructure. (I've been working on<br>

            >> SimplifyDemandedVectorElts recently for exactly

            this reason.) My<br>

            >> concern is that your EVL proposal will end up in

            the same state.<br>

            > I think that's just the nature of the beast.  We need

            IR-level support<br>

            > for masking and we have to teach LLVM about it.<br>

            I'm solidly of the opinion that we already *have* IR support

            for <br>

            explicit masking in the form of gather/scatter/etc...  Until

            someone has <br>

            taken the effort to make masking in this context *actually

            work well*, <br>

            I'm unconvinced that we should greatly expand the usage in

            the IR.<br>

            ><br>

            >                             -David<br>

            _______________________________________________<br>

            LLVM Developers mailing list<br>

            <a href="mailto:llvm-dev@lists.llvm.org" target="_blank"

              moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>

            <a

              href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev"

              rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

          </blockquote>

        </div>

      </div>

    </blockquote>

  </body>

</html>