[llvm-dev] Tail-Loop Folding/Predication

Mon Jul 15 08:42:37 PDT 2019

By "folded into the main loop", do you actually mean replace the main loop?
So, in effect, running the entire loop under predicate so there is only one
loop body?

If so, I think that will be a useful pragma in general, but in my opinion,
the name is not appropriate since it won't have anything to do with the
tail other than how this is accomplished at the moment. Is your thinking
that the front end would generate the mask calculation, or are you just
leveraging the exiting fold tail by masking and removing the original
vectorized loop body?

I think the proper implementation should really be to generate the
predicated instructions in the first place (I'd like to also see actual
predicates on the instructions instead of selects, but that is another
thread), so I think #pragma loop vectorize(enable) predicated(enable) (or
something like that) seems a better choice. This would also allow you to
disable loops run under predicate if the cost model in LLVM (or downstream)
in the future thinks its best to generate this type of loop and performance
numbers suggest otherwise.

On Mon, Jul 15, 2019 at 9:46 AM Sjoerd Meijer via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> I am looking for feedback to add support for a new loop pragma to
> Clang/LLVM.
> With "#pragma tail_predicate" the idea would be to indicate that a loop
> epilogue/tail can, or should be, folded into the main loop. I see two use
> cases for this pragma.
>
> First, this could be interesting for the vectorizer. It currently supports
> tail
> folding by masking all loop instructions/blocks, but does this only when
> optimising for size is enabled. This pragma could override the
> cost-model/opt-level.
>
> Second use case would be the Armv8.1-M MVE vector extension, which supports
> tail-predicated hardware loops. This version of hardware loops sets the
> vector
> lanes to be masked, and is thus a nice optimisation that avoids generating
> a
> tail loop when the number of elements processed is not a multiple of the
> vector
> length.
>
> For this use case, the tail predicate pragma could be good user experience
> improvement, as it would for example allow this more compact form without
> any predicated intrinsics:
>
>   #pragma tail_predicate
>   do {
>     VLD(..);   // some vector load intrinsic
>     VST(..);   // some vector store intrinsic
>     ..
>   } while (N);
>
> which can then be transformed and predication made explicit through data
> dependencies like so:
>
>   do {
>     mask = vctp(N);   // intrinsic that generates the mask of active lanes
>     VLD(.., mask);
>     VST(.., mask);
>     ..
>   } while (N);
>
> A vector loop in this form can easily be picked up the new hardware loop
> pass,
> and the corresponding tail-predicated hardware loop can be generated. This
> is
> only a small example, but we think for more complicated examples we think
> the benefit could be substantial.
>
> I have uploaded a patch for the initial Clang plumbing exercise here:
> https://reviews.llvm.org/D64744
>
> Cheers,
> Sjoerd.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190715/651d585c/attachment.html>