[llvm-dev] [RFC] Intrinsics for Hardware Loops
Philip Reames via llvm-dev
llvm-dev at lists.llvm.org
Tue May 28 11:00:44 PDT 2019
This seems like a generally reasonable approach. I have some hesitation
about the potential separation of the control flow and the intrinsics
(i.e. can we every confuse which loop they apply to?), but the basic
notion seems reasonable. Particularly so as Hal points out that we
already have something like this in PPC. I'd suggest framing this as
being an IR assist to backends rather than a canonical form or anything
expected to be used by frontends though.
A couple of random comments; there's no coherent message here, just a
collection of thoughts.
1) Your "loop.end" intrinsic is very confusingly named. I think you
definitely need something different there name wise. Also, you fail to
specify what the return value is.
2) Your get.active.mask.X is a generally useful construct, but I think
it can be represented via bitmath and a bitcast right? (i.e. does it
have to be an intrinsic?)
3) There seems to be a good amount of overlap with the SVE ideas. I'm
not suggesting it needs to be reconciled, just pointing out many of the
issues are common. (The more I see discussion of these topics, there
more unsettled it all feels. Trying out a couple of experimental
designs, and iterating until one wins is feeling more and more like the
On 5/20/19 4:00 AM, Sam Parker via llvm-dev wrote:
> Arm have recently announced the v8.1-M architecture specification for
> our next generation microcontrollers. The architecture includes
> vector extensions (MVE) and support for low-overhead branches (LoB),
> which can be thought of a style of hardware loop. Hardware loops
> aren't new to LLVM, other backends (at least Hexagon and PPC that I
> know of) also include support. These implementations insert the loop
> controlling instructions at the MachineInstr level and I'd like to
> propose that we add intrinsics to support this notion at the IR
> level; primarily to be able to use scalar evolution to understand the
> loops instead of having to implement a machine-level analysis for
> each target.
> I've posted an RFC with a prototype implementation in
> https://reviews.llvm.org/D62132. It contains intrinsics that are
> currently Arm specific, but I hope they're general enough to be used
> by all targets. The Arm v8.1-m architecture supports do-while and
> while loops, but for conciseness, here, I'd like to just focus on
> while loops. There's two parts to this RFC: (1) the intrinsics
> and (2) a prototype implementation in the Arm backend to enable
> tail-predicated machine loops.
> 1. LLVM IR Intrinsics
> In the following definitions, I use the term 'element' to describe
> the work performed by an IR loop that has not been vectorized or
> unrolled by the compiler. This should be equivalent to the loop at
> the source level.
> void @llvm.arm.set.loop.iterations(i32)
> - Takes as a single operand, the number of iterations to be executed.
> i32 @llvm.arm.set.loop.elements(i32, i32)
> - Takes two operands:
> - The total number of elements to be processed by the loop.
> - The maximum number of elements processed in one iteration of
> the IR loop body.
> - Returns the number of iterations to be executed.
> <X x i1> @llvm.arm.get.active.mask.X(i32)
> - Takes as an operand, the number of elements that still need
> - Where 'X' denotes the vectorization factor, returns an array of i1
> indicating which vector lanes are active for the current loop
> i32 @llvm.arm.loop.end(i32, i32)
> - Takes two operands:
> - The number of elements that still need processing.
> - The maximum number of elements processed in one iteration of the
> IR loop body.
> The following gives an illustration of their intended usage:
> %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
> %1 = icmp ne i32 %0, 0
> br i1 %1, label %vector.ph, label %for.loopexit
> br label %vector.body
> %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
> %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
> %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x
> i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
> tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4
> x i32>* %addr.1, i32 4, <4 x i1> %active)
> %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
> %cmp = icmp sgt i32 %elts.rem, 0
> br i1 %cmp, label %vector.body, label %for.loopexit
> ret void
> As the example shows, control-flow is still ultimately performed
> through the icmp and br pair. There's nothing connecting the
> intrinsics to a given loop or any requirement that a set.loop.* call
> needs to be paired with a loop.end call.
> 2. Low-overhead loops in the Arm backend
> Disclaimer: The prototype is barebones and reuses parts of NEON and
> I'm currently targeting the Cortex-A72 which does not support this
> feature! opt and llc build and the provided test case doesn't cause a
> The low-overhead branch extension can be combined with MVE to
> generate vectorized loops in which the epilogue is executed within
> the predicated vector body. The proposal is for this to be supported
> through a series of pass:
> 1) IR LoopPass to identify suitable loops and insert the intrinsics
> proposed above.
> 2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo
> 3) A final MachineFunctionPass to expand the pseudo instructions.
> To help / enable the lowering of of an i1 vector, the VPR register has
> been added. This is a status register that contains the P0 predicate
> and is also used to model the implicit predicates of tail-predicated
> There are two main reasons why pseudo instructions are used instead
> of generating MIs directly during ISel:
> 1) They gives us a chance of later inspecting the whole loop and
> confirm that it's a good idea to generate such a loop. This is
> trivial for scalar loops, but not really applicable for
> tail-predicated loops.
> 2) It allows us to separate the decrementing of the loop counter with
> the instruction that branches back, which should help us recover if
> LR gets spilt between these two pseudo ops.
> For Armv8.1-M, the while.setup intrinsic is used to generate the wls
> and wlstp instructions, while loop.end generates the le and letp
> instructions. The active.mask can just be removed because the lane
> predication is handled implicitly.
> I'm not sure of the vectorizers limitations of generating vector
> instructions that operate across lanes, such as reductions, when
> generating a predicated loop but this needs to be considered.
> I'd welcome any feedback here or on Phabricator and I'd especially like
> to know if this would useful to current targets.
> Sam Parker
> Compilation Tools Engineer | Arm
> . . . . . . . . . . . . . . . . . . . . . . . . . . .
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev