[llvm-dev] LV: predication

Mon May 4 13:22:36 PDT 2020

Hi Sjoerd,

> That would be an excellent way of doing it and it would also map very well
> to MVE too, where we have a VCTP intrinsic/instruction that creates the
> mask/predicate (Vector Create Tail-Predicate). So I will go for this
> approach. Such an intrinsic was actually also proposed in Sam's original
> RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html),
> but we hadn't implemented it yet. This intrinsic will probably look
> something like this:
>
>     <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)
>
> It produces a <N x i1> predicate based on its two arguments, the number of
> elements and the vector trip count, and it will be used by the predicated
> masked loads/stores instructions in the vector body. I will start drafting
> an implementation for this and continue with this in D79100.
>

I'm curious about this, because this looks to me very similar to the code
that -prefer-predicate-over-epilog is already emitting for the "outer mask"
of a tail-folded loop.

The following code

void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
#pragma clang loop vectorize(enable) interleave(disable)
  for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
  }
}

compiled with clang --target=x86_64 -mavx512f -mllvm
-prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR

vector.body:                                      ; preds = %vector.body,
%for.body.preheader.new
  %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1,
%vector.body ]
  %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [
%niter.nsub.1, %vector.body ]
  %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index,
i32 0
  %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12,
<16 x i64> undef, <16 x i32> zeroinitializer
  %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2, i64
3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64
13, i64 14, i64 15>
  %4 = getelementptr inbounds i32, i32* %b, i64 %index
  *%5 = icmp ule <16 x i64> %induction, %broadcast.splat*
  ...
  %wide.masked.load = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16
x i32>* %6, i32 4, *<16 x i1> %5*, <16 x i32> undef), !tbaa !2

I understand %5 is not the same your proposed llvm.loop.get.active.mask
would compute, is that correct? Can you elaborate on the difference here?

Thanks a lot,
Roger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/4f752064/attachment.html>