<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>I'll just note that I'm generally very skeptical of the argument

      in (2).  Not actively objective, but every time this general line

      of thought comes up, I find the reasoning unconvincing.  <br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 5/30/19 5:19 AM, Sam Parker wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:AM5PR0801MB19558C0A862CC1CC713F902F85180@AM5PR0801MB1955.eurprd08.prod.outlook.com">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        Hi Philip,</div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        Yes, these constructs should really only be used by the compiler

        and probably always very late in the pipeline. To address your

        other points:</div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        1) Agreed. loop.end has now renamed to 'loop.decrement'. I've

        also added 'loop.decrement.reg' which operates upon the updated

        loop counter, instead of some opaque system register.</div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        2) It could be handled by normal IR, the vectorizer currently

        splits out the equivalent when folding the epilogue into the

        loop body. The reason why we need an intrinsic is to work around

        the limitations of basic block isel. In our new architecture,

        the lane predication is implicit iff we can generate the

        hardware loop - but that doesn't prevent other instructions,

        predicated on something other than the loop index, from being

        generated too. At ISel <span style="color: rgb(0, 0, 0);

          font-family: Calibri, Arial, Helvetica, sans-serif; font-size:

          12pt;">we can't guarantee whether a predicate is loop index

          based or otherwise, so it has to be explicit coming into ISel.</span></div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <span style="color: rgb(0, 0, 0); font-family: Calibri, Arial,

          Helvetica, sans-serif; font-size: 12pt;">3) The main

          difference here is the same as (2). As I understand SVE, has

          bank of predicate registers that are explicitly accessed,

          whereas MVE has a status register that is used implicitly.</span></div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div id="Signature">

        <div id="divtagdefaultwrapper" dir="ltr" style="font-size:12pt;

          color:rgb(0,0,0); background-color:rgb(255,255,255);

          font-family:Calibri,Arial,Helvetica,sans-serif,EmojiFont,"Apple

          Color Emoji","Segoe UI

          Emoji",NotoColorEmoji,"Segoe UI

          Symbol","Android

          Emoji",EmojiSymbols,EmojiFont,"Apple Color

          Emoji","Segoe UI

          Emoji",NotoColorEmoji,"Segoe UI

          Symbol","Android Emoji",EmojiSymbols">

          <p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman""><span

              style="font-family:Calibri,Helvetica,sans-serif">Sam

              Parker</span></p>

          <span style="font-family:Calibri,Helvetica,sans-serif"></span>

          <p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman""><span

              style="font-family:Calibri,Helvetica,sans-serif">Compilation

              Tools Engineer | Arm</span></p>

          <span style="font-family:Calibri,Helvetica,sans-serif"></span>

          <p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman""><span

              style="font-family:Calibri,Helvetica,sans-serif">. . . . .

              . . . . . . . . . . . . . . . . . . . . . .</span></p>

          <span style="font-family:Calibri,Helvetica,sans-serif"></span>

          <p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman""><span

              style="font-family:Calibri,Helvetica,sans-serif">Arm.com</span></p>

        </div>

      </div>

      <hr style="display:inline-block;width:98%" tabindex="-1">

      <div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt"

          face="Calibri, sans-serif" color="#000000"><b>From:</b> Philip

          Reames <a class="moz-txt-link-rfc2396E" href="mailto:listmail@philipreames.com"><listmail@philipreames.com></a><br>

          <b>Sent:</b> 28 May 2019 19:00<br>

          <b>To:</b> Sam Parker; <a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

          <b>Cc:</b> nd<br>

          <b>Subject:</b> Re: [llvm-dev] [RFC] Intrinsics for Hardware

          Loops</font>

        <div> </div>

      </div>

      <div style="background-color:#FFFFFF">

        <p>This seems like a generally reasonable approach.  I have some

          hesitation about the potential separation of the control flow

          and the intrinsics (i.e. can we every confuse which loop they

          apply to?), but the basic notion seems reasonable. 

          Particularly so as Hal points out that we already have

          something like this in PPC.   I'd suggest framing this as

          being an IR assist to backends rather than a canonical form or

          anything expected to be used by frontends though.<br>

        </p>

        <p><br>

        </p>

        <p>A couple of random comments; there's no coherent message

          here, just a collection of thoughts.</p>

        <p><br>

        </p>

        <p>1) Your "loop.end" intrinsic is very confusingly named.  I

          think you definitely need something different there name

          wise.  Also, you fail to specify what the return value is.</p>

        <p>2) Your get.active.mask.X is a generally useful construct,

          but I think it can be represented via bitmath and a bitcast

          right?  (i.e. does it have to be an intrinsic?)</p>

        <p>3) There seems to be a good amount of overlap with the SVE

          ideas.  I'm not suggesting it needs to be reconciled, just

          pointing out many of the issues are common.  (The more I see

          discussion of these topics, there more unsettled it all

          feels.  Trying out a couple of experimental designs, and

          iterating until one wins is feeling more and more like the

          right approach.)</p>

        <p><br>

        </p>

        <p>Philip<br>

        </p>

        <p><br>

        </p>

        <p><br>

        </p>

        <p><br>

        </p>

        <div class="x_moz-cite-prefix">On 5/20/19 4:00 AM, Sam Parker

          via llvm-dev wrote:<br>

        </div>

        <blockquote type="cite">

          <style type="text/css" style="display:none">

<!--

p

        {margin-top:0;

        margin-bottom:0}

-->

</style>

          <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

            font-size:12pt; color:rgb(0,0,0)">

            <span>Hi,<br>

            </span>

            <div><br>

            </div>

            <div>Arm have recently announced the v8.1-M architecture

              specification for</div>

            <div>our  next generation microcontrollers. The architecture

              includes<br>

            </div>

            <div>vector extensions (MVE) and support for low-overhead

              branches (LoB),<br>

            </div>

            <div>which can be thought of a style of hardware loop.

              Hardware loops<br>

            </div>

            <div>aren't new to LLVM, other backends (at least Hexagon

              and PPC that I<br>

            </div>

            <div>know of) also include support. These implementations

              insert the loop<br>

            </div>

            <div>controlling instructions at the MachineInstr level and

              I'd like to<br>

            </div>

            <div>propose that we add intrinsics to support this notion

              at the IR<br>

            </div>

            <div>level; primarily to be able to use scalar evolution to

              understand the<br>

            </div>

            <div>loops instead of having to implement a machine-level

              analysis for<br>

            </div>

            <div>each target.<br>

            </div>

            <div><br>

            </div>

            <div>I've posted an RFC with a prototype implementation in<br>

            </div>

            <div><a class="x_moz-txt-link-freetext"

                href="https://reviews.llvm.org/D62132"

                moz-do-not-send="true">https://reviews.llvm.org/D62132</a>.

              It contains intrinsics that are<br>

            </div>

            <div>currently Arm specific, but I hope they're general

              enough to be used<br>

            </div>

            <div>by all targets. The Arm v8.1-m architecture supports

              do-while and<br>

            </div>

            <div>while loops, but for conciseness, here, I'd like to

              just focus on<br>

            </div>

            <div>while loops. There's two parts to this RFC: (1) the

              intrinsics<br>

            </div>

            <div>and (2) a prototype implementation in the Arm backend

              to enable<br>

            </div>

            <div>tail-predicated machine loops.<br>

            </div>

            <div>    <br>

            </div>

            <div>1. LLVM IR Intrinsics<br>

            </div>

            <div>    <br>

            </div>

            <div>In the following definitions, I use the term 'element'

              to describe<br>

            </div>

            <div>the work performed by an IR loop that has not been

              vectorized or<br>

            </div>

            <div>unrolled by the compiler. This should be equivalent to

              the loop at<br>

            </div>

            <div>the source level.<br>

            </div>

            <div>    <br>

            </div>

            <div>void @llvm.arm.set.loop.iterations(i32)<br>

            </div>

            <div>- Takes as a single operand, the number of iterations

              to be executed.<br>

            </div>

            <div>    <br>

            </div>

            <div>i32 @llvm.arm.set.loop.elements(i32, i32)<br>

            </div>

            <div>- Takes two operands:<br>

            </div>

            <div>  - The total number of elements to be processed by the

              loop.<br>

            </div>

            <div>  - The maximum number of elements processed in one

              iteration of<br>

            </div>

            <div>    the IR loop body.<br>

            </div>

            <div>- Returns the number of iterations to be executed.<br>

            </div>

            <div>    <br>

            </div>

            <div><X x i1> @llvm.arm.get.active.mask.X(i32)<br>

            </div>

            <div>- Takes as an operand, the number of elements that

              still need<br>

            </div>

            <div>  processing.<br>

            </div>

            <div>- Where 'X' denotes the vectorization factor, returns

              an array of i1<br>

            </div>

            <div>  indicating which vector lanes are active for the

              current loop<br>

            </div>

            <div>  iteration.<br>

            </div>

            <div>    <br>

            </div>

            <div>i32 @llvm.arm.loop.end(i32, i32)<br>

            </div>

            <div>- Takes two operands:<br>

            </div>

            <div>  - The number of elements that still need processing.<br>

            </div>

            <div>  - The maximum number of elements processed in one

              iteration of the<br>

            </div>

            <div>    IR loop body.<br>

            </div>

            <div>    <br>

            </div>

            <div>The following gives an illustration of their intended

              usage:<br>

            </div>

            <div>    <br>

            </div>

            <div>entry:<br>

            </div>

            <div>  %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32

              4)<br>

            </div>

            <div>  %1 = icmp ne i32 %0, 0<br>

            </div>

            <div>  br i1 %1, label %vector.ph, label %for.loopexit<br>

            </div>

            <div>    <br>

            </div>

            <div>vector.ph:<br>

            </div>

            <div>  br label %vector.body<br>

            </div>

            <div>    <br>

            </div>

            <div>vector.body:<br>

            </div>

            <div>  %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem,

              %vector.body ]<br>

            </div>

            <div>  %active = call <4 x i1>

              @llvm.arm.get.active.mask(i32 %elts, i32 4)<br>

            </div>

            <div>  %load = tail call <4 x i32>

              @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr,

              i32 4, <4 x i1> %active, <4 x i32> undef)<br>

            </div>

            <div>  tail call void @llvm.masked.store.v4i32.p0v4i32(<4

              x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x

              i1> %active)<br>

            </div>

            <div>  %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts,

              i32 4)<br>

            </div>

            <div>  %cmp = icmp sgt i32 %elts.rem, 0<br>

            </div>

            <div>  br i1 %cmp, label %vector.body, label %for.loopexit<br>

            </div>

            <div>    <br>

            </div>

            <div>for.loopexit:<br>

            </div>

            <div>  ret void<br>

            </div>

            <div>    <br>

            </div>

            <div>As the example shows, control-flow is still ultimately

              performed<br>

            </div>

            <div>through the icmp and br pair. There's nothing

              connecting the<br>

            </div>

            <div>intrinsics to a given loop or any requirement that a

              set.loop.* call<br>

            </div>

            <div>needs to be paired with a loop.end call.<br>

            </div>

            <div>    <br>

            </div>

            <div>2. Low-overhead loops in the Arm backend<br>

            </div>

            <div>    <br>

            </div>

            <div>Disclaimer: The prototype is barebones and reuses parts

              of NEON and<br>

            </div>

            <div>I'm currently targeting the Cortex-A72 which does not

              support this<br>

            </div>

            <div>feature! opt and llc build and the provided test case

              doesn't cause a<br>

            </div>

            <div>crash...<br>

            </div>

            <div>    <br>

            </div>

            <div>The low-overhead branch extension can be combined with

              MVE to<br>

            </div>

            <div>generate vectorized loops in which the epilogue is

              executed within<br>

            </div>

            <div>the predicated vector body. The proposal is for this to

              be supported<br>

            </div>

            <div>through a series of pass:<br>

            </div>

            <div>1) IR LoopPass to identify suitable loops and insert

              the intrinsics<br>

            </div>

            <div>   proposed above.<br>

            </div>

            <div>2) DAGToDAG ISel which makes the intrinsics, almost

              1-1, to a pseduo<br>

            </div>

            <div>   instruction.<br>

            </div>

            <div>3) A final MachineFunctionPass to expand the pseudo

              instructions.<br>

            </div>

            <div>    <br>

            </div>

            <div>To help / enable the lowering of of an i1 vector, the

              VPR register has<br>

            </div>

            <div>been added. This is a status register that contains the

              P0 predicate<br>

            </div>

            <div>and is also used to model the implicit predicates of

              tail-predicated<br>

            </div>

            <div>loops.<br>

            </div>

            <div>    <br>

            </div>

            <div>There are two main reasons why pseudo instructions are

              used instead<br>

            </div>

            <div>of generating MIs directly during ISel:<br>

            </div>

            <div>1) They gives us a chance of later inspecting the whole

              loop and<br>

            </div>

            <div>   confirm that it's a good idea to generate such a

              loop. This is<br>

            </div>

            <div>   trivial for scalar loops, but not really applicable

              for<br>

            </div>

            <div>   tail-predicated loops.<br>

            </div>

            <div>2) It allows us to separate the decrementing of the

              loop counter with<br>

            </div>

            <div>   the instruction that branches back, which should

              help us recover if<br>

            </div>

            <div>   LR gets spilt between these two pseudo ops.<br>

            </div>

            <div>    <br>

            </div>

            <div>For Armv8.1-M, the while.setup intrinsic is used to

              generate the wls<br>

            </div>

            <div>and wlstp instructions, while loop.end generates the le

              and letp<br>

            </div>

            <div>instructions. The active.mask can just be removed

              because the lane<br>

            </div>

            <div>predication is handled implicitly.<br>

            </div>

            <div>    <br>

            </div>

            <div>I'm not sure of the vectorizers limitations of

              generating vector<br>

            </div>

            <div>instructions that operate across lanes, such as

              reductions, when<br>

            </div>

            <span>generating a predicated loop but this needs to be

              considered.</span><br>

          </div>

          <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

            font-size:12pt; color:rgb(0,0,0)">

            <span><br>

            </span></div>

          <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

            font-size:12pt; color:rgb(0,0,0)">

            <span>I'd welcome any feedback here or on Phabricator and

              I'd especially like</span></div>

          <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

            font-size:12pt; color:rgb(0,0,0)">

            <span>to know if this would useful to current targets.</span></div>

          <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

            font-size:12pt; color:rgb(0,0,0)">

            <span><br>

            </span></div>

          <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

            font-size:12pt; color:rgb(0,0,0)">

            <span>cheers,</span></div>

          <div style="font-family:Calibri,Arial,Helvetica,sans-serif;

            font-size:12pt; color:rgb(0,0,0)">

            <br>

          </div>

          <div id="x_Signature">

            <div id="x_divtagdefaultwrapper" dir="ltr" style="">

              <p style="margin-top:0px; margin-bottom:0px;

                font-family:"Times New Roman""><span

                  style="font-family:Calibri,Helvetica,sans-serif">Sam

                  Parker</span></p>

              <span style="font-family:Calibri,Helvetica,sans-serif"></span>

              <p style="margin-top:0px; margin-bottom:0px;

                font-family:"Times New Roman""><span

                  style="font-family:Calibri,Helvetica,sans-serif">Compilation

                  Tools Engineer | Arm</span></p>

              <span style="font-family:Calibri,Helvetica,sans-serif"></span>

              <p style="margin-top:0px; margin-bottom:0px;

                font-family:"Times New Roman""><span

                  style="font-family:Calibri,Helvetica,sans-serif">. . .

                  . . . . . . . . . . . . . . . . . . . . . . . .</span></p>

              <span style="font-family:Calibri,Helvetica,sans-serif"></span>

              <p style="margin-top:0px; margin-bottom:0px;

                font-family:"Times New Roman""><span

                  style="font-family:Calibri,Helvetica,sans-serif">Arm.com</span></p>

            </div>

          </div>

          <br>

          <fieldset class="x_mimeAttachmentHeader"></fieldset>

          <pre class="x_moz-quote-pre">_______________________________________________

LLVM Developers mailing list

<a class="x_moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>

<a class="x_moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>