<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

</head>

<body bgcolor="#FFFFFF" text="#000000">

<p><br>

</p>

<div class="moz-cite-prefix">On 5/20/19 6:00 AM, Sam Parker via llvm-dev wrote:<br>

</div>

<blockquote type="cite" cite="mid:AM5PR0801MB1955F39A0D1BA4070CDB24A885060@AM5PR0801MB1955.eurprd08.prod.outlook.com">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<span>Hi,<br>

</span>

<div><br>

</div>

<div>Arm have recently announced the v8.1-M architecture specification for</div>

<div>our  next generation microcontrollers. The architecture includes<br>

</div>

<div>vector extensions (MVE) and support for low-overhead branches (LoB),<br>

</div>

<div>which can be thought of a style of hardware loop. Hardware loops<br>

</div>

<div>aren't new to LLVM, other backends (at least Hexagon and PPC that I<br>

</div>

<div>know of) also include support. These implementations insert the loop<br>

</div>

<div>controlling instructions at the MachineInstr level and I'd like to<br>

</div>

<div>propose that we add intrinsics to support this notion at the IR<br>

</div>

<div>level;</div>

</div>

</blockquote>

<p><br>

</p>

<p><br>

</p>

<p>The PPC implementation also recognizes loops at the IR level (in lib/Target/PowerPC/PPCCTRLoops.cpp) and then matches the relevant combinations of intrinsics and conditional branches during SDAG ISel. The intrinsics that PPC uses are:</p>

<p></p>

<blockquote type="cite">  def int_ppc_mtctr : Intrinsic<[], [llvm_anyint_ty], []>;<br>

  def int_ppc_is_decremented_ctr_nonzero :<br>

    Intrinsic<[llvm_i1_ty], [], [IntrNoDuplicate]>;</blockquote>

<br>

This proposal actually sounds very similar to what PPC currently does for counter-based loops. This solution tends to work well, in part because we can use SCEV to analyze loops at the IR level and generate trip-count expressions.

<p><br>

</p>

<blockquote type="cite" cite="mid:AM5PR0801MB1955F39A0D1BA4070CDB24A885060@AM5PR0801MB1955.eurprd08.prod.outlook.com">

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<div>primarily to be able to use scalar evolution to understand the<br>

</div>

<div>loops instead of having to implement a machine-level analysis for<br>

</div>

<div>each target.<br>

</div>

<div><br>

</div>

<div>I've posted an RFC with a prototype implementation in<br>

</div>

<div><a class="moz-txt-link-freetext" href="https://reviews.llvm.org/D62132">https://reviews.llvm.org/D62132</a>. It contains intrinsics that are<br>

</div>

<div>currently Arm specific, but I hope they're general enough to be used<br>

</div>

<div>by all targets. The Arm v8.1-m architecture supports do-while and<br>

</div>

<div>while loops, but for conciseness, here, I'd like to just focus on<br>

</div>

<div>while loops. There's two parts to this RFC: (1) the intrinsics<br>

</div>

<div>and (2) a prototype implementation in the Arm backend to enable<br>

</div>

<div>tail-predicated machine loops.<br>

</div>

<div>    <br>

</div>

<div>1. LLVM IR Intrinsics<br>

</div>

<div>    <br>

</div>

<div>In the following definitions, I use the term 'element' to describe<br>

</div>

<div>the work performed by an IR loop that has not been vectorized or<br>

</div>

<div>unrolled by the compiler. This should be equivalent to the loop at<br>

</div>

<div>the source level.<br>

</div>

<div>    <br>

</div>

<div>void @llvm.arm.set.loop.iterations(i32)<br>

</div>

<div>- Takes as a single operand, the number of iterations to be executed.<br>

</div>

<div>    <br>

</div>

<div>i32 @llvm.arm.set.loop.elements(i32, i32)<br>

</div>

<div>- Takes two operands:<br>

</div>

<div>  - The total number of elements to be processed by the loop.<br>

</div>

<div>  - The maximum number of elements processed in one iteration of<br>

</div>

<div>    the IR loop body.<br>

</div>

<div>- Returns the number of iterations to be executed.<br>

</div>

<div>    <br>

</div>

<div><X x i1> @llvm.arm.get.active.mask.X(i32)<br>

</div>

<div>- Takes as an operand, the number of elements that still need<br>

</div>

<div>  processing.<br>

</div>

<div>- Where 'X' denotes the vectorization factor, returns an array of i1<br>

</div>

<div>  indicating which vector lanes are active for the current loop<br>

</div>

<div>  iteration.<br>

</div>

<div>    <br>

</div>

<div>i32 @llvm.arm.loop.end(i32, i32)<br>

</div>

<div>- Takes two operands:<br>

</div>

<div>  - The number of elements that still need processing.<br>

</div>

<div>  - The maximum number of elements processed in one iteration of the<br>

</div>

<div>    IR loop body.<br>

</div>

<div>    <br>

</div>

<div>The following gives an illustration of their intended usage:<br>

</div>

<div>    <br>

</div>

<div>entry:<br>

</div>

<div>  %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)<br>

</div>

<div>  %1 = icmp ne i32 %0, 0<br>

</div>

<div>  br i1 %1, label %vector.ph, label %for.loopexit<br>

</div>

<div>    <br>

</div>

<div>vector.ph:<br>

</div>

<div>  br label %vector.body<br>

</div>

<div>    <br>

</div>

<div>vector.body:<br>

</div>

<div>  %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]<br>

</div>

<div>  %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)<br>

</div>

<div>  %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)<br>

</div>

<div>  tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)<br>

</div>

<div>  %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)<br>

</div>

<div>  %cmp = icmp sgt i32 %elts.rem, 0<br>

</div>

<div>  br i1 %cmp, label %vector.body, label %for.loopexit<br>

</div>

<div>    <br>

</div>

<div>for.loopexit:<br>

</div>

<div>  ret void<br>

</div>

<div>    <br>

</div>

<div>As the example shows, control-flow is still ultimately performed<br>

</div>

<div>through the icmp and br pair. There's nothing connecting the<br>

</div>

<div>intrinsics to a given loop or any requirement that a set.loop.* call<br>

</div>

<div>needs to be paired with a loop.end call.<br>

</div>

<div>    <br>

</div>

<div>2. Low-overhead loops in the Arm backend<br>

</div>

<div>    <br>

</div>

<div>Disclaimer: The prototype is barebones and reuses parts of NEON and<br>

</div>

<div>I'm currently targeting the Cortex-A72 which does not support this<br>

</div>

<div>feature! opt and llc build and the provided test case doesn't cause a<br>

</div>

<div>crash...<br>

</div>

<div>    <br>

</div>

<div>The low-overhead branch extension can be combined with MVE to<br>

</div>

<div>generate vectorized loops in which the epilogue is executed within<br>

</div>

<div>the predicated vector body. The proposal is for this to be supported<br>

</div>

<div>through a series of pass:<br>

</div>

<div>1) IR LoopPass to identify suitable loops and insert the intrinsics<br>

</div>

<div>   proposed above.<br>

</div>

<div>2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo<br>

</div>

<div>   instruction.<br>

</div>

<div>3) A final MachineFunctionPass to expand the pseudo instructions.<br>

</div>

<div>    <br>

</div>

<div>To help / enable the lowering of of an i1 vector, the VPR register has<br>

</div>

<div>been added. This is a status register that contains the P0 predicate<br>

</div>

<div>and is also used to model the implicit predicates of tail-predicated<br>

</div>

<div>loops.<br>

</div>

<div>    <br>

</div>

<div>There are two main reasons why pseudo instructions are used instead<br>

</div>

<div>of generating MIs directly during ISel:<br>

</div>

<div>1) They gives us a chance of later inspecting the whole loop and<br>

</div>

<div>   confirm that it's a good idea to generate such a loop. This is<br>

</div>

<div>   trivial for scalar loops, but not really applicable for<br>

</div>

<div>   tail-predicated loops.<br>

</div>

</div>

</blockquote>

<p><br>

</p>

<p>Is the idea is that you'll be able to fall back to using regular branch instructions for generating the loops? Are you doing this before or after register allocation?</p>

<p><br>

</p>

<p> -Hal</p>

<p><br>

</p>

<blockquote type="cite" cite="mid:AM5PR0801MB1955F39A0D1BA4070CDB24A885060@AM5PR0801MB1955.eurprd08.prod.outlook.com">

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<div></div>

<div>2) It allows us to separate the decrementing of the loop counter with<br>

</div>

<div>   the instruction that branches back, which should help us recover if<br>

</div>

<div>   LR gets spilt between these two pseudo ops.<br>

</div>

<div>    <br>

</div>

<div>For Armv8.1-M, the while.setup intrinsic is used to generate the wls<br>

</div>

<div>and wlstp instructions, while loop.end generates the le and letp<br>

</div>

<div>instructions. The active.mask can just be removed because the lane<br>

</div>

<div>predication is handled implicitly.<br>

</div>

<div>    <br>

</div>

<div>I'm not sure of the vectorizers limitations of generating vector<br>

</div>

<div>instructions that operate across lanes, such as reductions, when<br>

</div>

<span>generating a predicated loop but this needs to be considered.</span><br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<span><br>

</span></div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<span>I'd welcome any feedback here or on Phabricator and I'd especially like</span></div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<span>to know if this would useful to current targets.</span></div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<span><br>

</span></div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<span>cheers,</span></div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div id="Signature">

<div id="divtagdefaultwrapper" dir="ltr" style="font-size:12pt;

          color:rgb(0,0,0); background-color:rgb(255,255,255);

          font-family:Calibri,Arial,Helvetica,sans-serif,EmojiFont,"Apple

          Color Emoji","Segoe UI

          Emoji",NotoColorEmoji,"Segoe UI

          Symbol","Android

          Emoji",EmojiSymbols,EmojiFont,"Apple Color

          Emoji","Segoe UI

          Emoji",NotoColorEmoji,"Segoe UI

          Symbol","Android Emoji",EmojiSymbols">

<p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman"">

<span style="font-family:Calibri,Helvetica,sans-serif">Sam Parker</span></p>

<span style="font-family:Calibri,Helvetica,sans-serif"></span>

<p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman"">

<span style="font-family:Calibri,Helvetica,sans-serif">Compilation Tools Engineer | Arm</span></p>

<span style="font-family:Calibri,Helvetica,sans-serif"></span>

<p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman"">

<span style="font-family:Calibri,Helvetica,sans-serif">. . . . . . . . . . . . . . . . . . . . . . . . . . .</span></p>

<span style="font-family:Calibri,Helvetica,sans-serif"></span>

<p style="margin-top: 0px; margin-bottom:

            0px;font-family:"Times New Roman"">

<span style="font-family:Calibri,Helvetica,sans-serif">Arm.com</span></p>

</div>

</div>

<br>

<fieldset class="mimeAttachmentHeader"></fieldset>

<pre class="moz-quote-pre" wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

</blockquote>

<pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

</body>

</html>