[llvm-dev] LV: predication

Mon May 4 23:21:31 PDT 2020

Hi Sjoerd,

thanks a lot for the clarification. Makes sense.

Kind regards,

Missatge de Sjoerd Meijer <Sjoerd.Meijer at arm.com> del dia dt., 5 de maig
2020 a les 0:06:

> Hi Roger,
>
> That's a good example, that shows most of the moving parts involved here.
> In a nutshell, the difference is, and what we would like to make explicit,
> is the vector trip versus the scalar loop trip count. In your IR example,
> the loads/stores are predicated on a mask that is calculated from a splat
> induction variable, which is compared with the vector trip count.
> Illustrated with your example simplified, and with some pseudo-code, if we
> tail-fold and vectorize this scalar loop:
>
> for i= 0 to 10
>   a[i] = b[i] + c[i];
>
> the vector loop trip count is rounded up to 14, the next multiple of 4,
> and lanes are predicated on i < 10:
>
> for i= 0 to 12
>   a[i:4] = b[i:4] + c[i:4],    if i < 10;
>
> what we would like to generate is a vector loop with implicit predication,
> which works by setting up the the number of elements processed by the loop:
>
> hwloop 10
>   [i:4] = b[i:4] + c[i:4]
>
> This is implicit since instructions don't produce/consume a mask, but it
> is generated ans used under the hood by the "hwloop" construct. Your
> observation that the information in the IR is mostly there is correct, but
> rather than pattern matching and reconstructing this in the backend, we
> would like to makes this explicit. In this example, the scalar iteration
> count 10 iis the number of elements processed by this loop, which is what
> we want to pass on from the vectoriser to backend passes.
>
> Hope this helps.
> Cheers,
> Sjoerd.
>
>
>
> ------------------------------
> *From:* Roger Ferrer Ibáñez <rofirrim at gmail.com>
> *Sent:* 04 May 2020 21:22
> *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> *Cc:* Eli Friedman <efriedma at quicinc.com>; llvm-dev <
> llvm-dev at lists.llvm.org>; Sam Parker <Sam.Parker at arm.com>
> *Subject:* Re: [llvm-dev] LV: predication
>
> Hi Sjoerd,
>
>
> That would be an excellent way of doing it and it would also map very well
> to MVE too, where we have a VCTP intrinsic/instruction that creates the
> mask/predicate (Vector Create Tail-Predicate). So I will go for this
> approach. Such an intrinsic was actually also proposed in Sam's original
> RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html),
> but we hadn't implemented it yet. This intrinsic will probably look
> something like this:
>
>     <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)
>
> It produces a <N x i1> predicate based on its two arguments, the number of
> elements and the vector trip count, and it will be used by the predicated
> masked loads/stores instructions in the vector body. I will start drafting
> an implementation for this and continue with this in D79100.
>
>
> I'm curious about this, because this looks to me very similar to the code
> that -prefer-predicate-over-epilog is already emitting for the "outer mask"
> of a tail-folded loop.
>
> The following code
>
> void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
> #pragma clang loop vectorize(enable) interleave(disable)
>   for (int i = 0; i < N; i++) {
>     a[i] = b[i] + c[i];
>   }
> }
>
> compiled with clang --target=x86_64 -mavx512f -mllvm
> -prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR
>
> vector.body:                                      ; preds = %vector.body,
> %for.body.preheader.new
>   %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1,
> %vector.body ]
>   %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [
> %niter.nsub.1, %vector.body ]
>   %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index,
> i32 0
>   %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12,
> <16 x i64> undef, <16 x i32> zeroinitializer
>   %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2, i64
> 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64
> 13, i64 14, i64 15>
>   %4 = getelementptr inbounds i32, i32* %b, i64 %index
>   *%5 = icmp ule <16 x i64> %induction, %broadcast.splat*
>   ...
>   %wide.masked.load = call <16 x i32>
> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, *<16 x i1> %5*,
> <16 x i32> undef), !tbaa !2
>
> I understand %5 is not the same your proposed llvm.loop.get.active.mask
> would compute, is that correct? Can you elaborate on the difference here?
>
> Thanks a lot,
> Roger
>

-- 
Roger Ferrer Ibáñez
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200505/66432145/attachment.html>