[llvm-dev] [RFC] Intrinsics for Hardware Loops

Thu May 30 10:54:50 PDT 2019

I'll just note that I'm generally very skeptical of the argument in
(2).  Not actively objective, but every time this general line of
thought comes up, I find the reasoning unconvincing. 


On 5/30/19 5:19 AM, Sam Parker wrote:
> Hi Philip,
>
> Yes, these constructs should really only be used by the compiler and
> probably always very late in the pipeline. To address your other points:
>
> 1) Agreed. loop.end has now renamed to 'loop.decrement'. I've also
> added 'loop.decrement.reg' which operates upon the updated loop
> counter, instead of some opaque system register.
> 2) It could be handled by normal IR, the vectorizer currently splits
> out the equivalent when folding the epilogue into the loop body. The
> reason why we need an intrinsic is to work around the limitations of
> basic block isel. In our new architecture, the lane predication is
> implicit iff we can generate the hardware loop - but that doesn't
> prevent other instructions, predicated on something other than the
> loop index, from being generated too. At ISel we can't guarantee
> whether a predicate is loop index based or otherwise, so it has to be
> explicit coming into ISel.
> 3) The main difference here is the same as (2). As I understand SVE,
> has bank of predicate registers that are explicitly accessed, whereas
> MVE has a status register that is used implicitly.
>
> Sam Parker
>
> Compilation Tools Engineer | Arm
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . .
>
> Arm.com
>
> ------------------------------------------------------------------------
> *From:* Philip Reames <listmail at philipreames.com>
> *Sent:* 28 May 2019 19:00
> *To:* Sam Parker; llvm-dev at lists.llvm.org
> *Cc:* nd
> *Subject:* Re: [llvm-dev] [RFC] Intrinsics for Hardware Loops
>  
>
> This seems like a generally reasonable approach.  I have some
> hesitation about the potential separation of the control flow and the
> intrinsics (i.e. can we every confuse which loop they apply to?), but
> the basic notion seems reasonable.  Particularly so as Hal points out
> that we already have something like this in PPC.   I'd suggest framing
> this as being an IR assist to backends rather than a canonical form or
> anything expected to be used by frontends though.
>
>
> A couple of random comments; there's no coherent message here, just a
> collection of thoughts.
>
>
> 1) Your "loop.end" intrinsic is very confusingly named.  I think you
> definitely need something different there name wise.  Also, you fail
> to specify what the return value is.
>
> 2) Your get.active.mask.X is a generally useful construct, but I think
> it can be represented via bitmath and a bitcast right?  (i.e. does it
> have to be an intrinsic?)
>
> 3) There seems to be a good amount of overlap with the SVE ideas.  I'm
> not suggesting it needs to be reconciled, just pointing out many of
> the issues are common.  (The more I see discussion of these topics,
> there more unsettled it all feels.  Trying out a couple of
> experimental designs, and iterating until one wins is feeling more and
> more like the right approach.)
>
>
> Philip
>
>
>
>
> On 5/20/19 4:00 AM, Sam Parker via llvm-dev wrote:
>> Hi,
>>
>> Arm have recently announced the v8.1-M architecture specification for
>> our  next generation microcontrollers. The architecture includes
>> vector extensions (MVE) and support for low-overhead branches (LoB),
>> which can be thought of a style of hardware loop. Hardware loops
>> aren't new to LLVM, other backends (at least Hexagon and PPC that I
>> know of) also include support. These implementations insert the loop
>> controlling instructions at the MachineInstr level and I'd like to
>> propose that we add intrinsics to support this notion at the IR
>> level; primarily to be able to use scalar evolution to understand the
>> loops instead of having to implement a machine-level analysis for
>> each target.
>>
>> I've posted an RFC with a prototype implementation in
>> https://reviews.llvm.org/D62132. It contains intrinsics that are
>> currently Arm specific, but I hope they're general enough to be used
>> by all targets. The Arm v8.1-m architecture supports do-while and
>> while loops, but for conciseness, here, I'd like to just focus on
>> while loops. There's two parts to this RFC: (1) the intrinsics
>> and (2) a prototype implementation in the Arm backend to enable
>> tail-predicated machine loops.
>>    
>> 1. LLVM IR Intrinsics
>>    
>> In the following definitions, I use the term 'element' to describe
>> the work performed by an IR loop that has not been vectorized or
>> unrolled by the compiler. This should be equivalent to the loop at
>> the source level.
>>    
>> void @llvm.arm.set.loop.iterations(i32)
>> - Takes as a single operand, the number of iterations to be executed.
>>    
>> i32 @llvm.arm.set.loop.elements(i32, i32)
>> - Takes two operands:
>>   - The total number of elements to be processed by the loop.
>>   - The maximum number of elements processed in one iteration of
>>     the IR loop body.
>> - Returns the number of iterations to be executed.
>>    
>> <X x i1> @llvm.arm.get.active.mask.X(i32)
>> - Takes as an operand, the number of elements that still need
>>   processing.
>> - Where 'X' denotes the vectorization factor, returns an array of i1
>>   indicating which vector lanes are active for the current loop
>>   iteration.
>>    
>> i32 @llvm.arm.loop.end(i32, i32)
>> - Takes two operands:
>>   - The number of elements that still need processing.
>>   - The maximum number of elements processed in one iteration of the
>>     IR loop body.
>>    
>> The following gives an illustration of their intended usage:
>>    
>> entry:
>>   %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
>>   %1 = icmp ne i32 %0, 0
>>   br i1 %1, label %vector.ph, label %for.loopexit
>>    
>> vector.ph:
>>   br label %vector.body
>>    
>> vector.body:
>>   %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
>>   %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
>>   %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x
>> i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
>>   tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4
>> x i32>* %addr.1, i32 4, <4 x i1> %active)
>>   %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
>>   %cmp = icmp sgt i32 %elts.rem, 0
>>   br i1 %cmp, label %vector.body, label %for.loopexit
>>    
>> for.loopexit:
>>   ret void
>>    
>> As the example shows, control-flow is still ultimately performed
>> through the icmp and br pair. There's nothing connecting the
>> intrinsics to a given loop or any requirement that a set.loop.* call
>> needs to be paired with a loop.end call.
>>    
>> 2. Low-overhead loops in the Arm backend
>>    
>> Disclaimer: The prototype is barebones and reuses parts of NEON and
>> I'm currently targeting the Cortex-A72 which does not support this
>> feature! opt and llc build and the provided test case doesn't cause a
>> crash...
>>    
>> The low-overhead branch extension can be combined with MVE to
>> generate vectorized loops in which the epilogue is executed within
>> the predicated vector body. The proposal is for this to be supported
>> through a series of pass:
>> 1) IR LoopPass to identify suitable loops and insert the intrinsics
>>    proposed above.
>> 2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo
>>    instruction.
>> 3) A final MachineFunctionPass to expand the pseudo instructions.
>>    
>> To help / enable the lowering of of an i1 vector, the VPR register has
>> been added. This is a status register that contains the P0 predicate
>> and is also used to model the implicit predicates of tail-predicated
>> loops.
>>    
>> There are two main reasons why pseudo instructions are used instead
>> of generating MIs directly during ISel:
>> 1) They gives us a chance of later inspecting the whole loop and
>>    confirm that it's a good idea to generate such a loop. This is
>>    trivial for scalar loops, but not really applicable for
>>    tail-predicated loops.
>> 2) It allows us to separate the decrementing of the loop counter with
>>    the instruction that branches back, which should help us recover if
>>    LR gets spilt between these two pseudo ops.
>>    
>> For Armv8.1-M, the while.setup intrinsic is used to generate the wls
>> and wlstp instructions, while loop.end generates the le and letp
>> instructions. The active.mask can just be removed because the lane
>> predication is handled implicitly.
>>    
>> I'm not sure of the vectorizers limitations of generating vector
>> instructions that operate across lanes, such as reductions, when
>> generating a predicated loop but this needs to be considered.
>>
>> I'd welcome any feedback here or on Phabricator and I'd especially like
>> to know if this would useful to current targets.
>>
>> cheers,
>>
>> Sam Parker
>>
>> Compilation Tools Engineer | Arm
>>
>> . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>
>> Arm.com
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190530/eec965f9/attachment.html>