[llvm-dev] LV: predication

Sjoerd Meijer via llvm-dev llvm-dev at lists.llvm.org
Mon May 4 15:05:56 PDT 2020


Hi Roger,

That's a good example, that shows most of the moving parts involved here. In a nutshell, the difference is, and what we would like to make explicit, is the vector trip versus the scalar loop trip count. In your IR example, the loads/stores are predicated on a mask that is calculated from a splat induction variable, which is compared with the vector trip count. Illustrated with your example simplified, and with some pseudo-code, if we tail-fold and vectorize this scalar loop:

for i= 0 to 10
  a[i] = b[i] + c[i];

the vector loop trip count is rounded up to 14, the next multiple of 4, and lanes are predicated on i < 10:

for i= 0 to 12
  a[i:4] = b[i:4] + c[i:4],    if i < 10;

what we would like to generate is a vector loop with implicit predication, which works by setting up the the number of elements processed by the loop:

hwloop 10
  [i:4] = b[i:4] + c[i:4]

This is implicit since instructions don't produce/consume a mask, but it is generated ans used under the hood by the "hwloop" construct. Your observation that the information in the IR is mostly there is correct, but rather than pattern matching and reconstructing this in the backend, we would like to makes this explicit. In this example, the scalar iteration count 10 iis the number of elements processed by this loop, which is what we want to pass on from the vectoriser to backend passes.

Hope this helps.
Cheers,
Sjoerd.



________________________________
From: Roger Ferrer Ibáñez <rofirrim at gmail.com>
Sent: 04 May 2020 21:22
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: Eli Friedman <efriedma at quicinc.com>; llvm-dev <llvm-dev at lists.llvm.org>; Sam Parker <Sam.Parker at arm.com>
Subject: Re: [llvm-dev] LV: predication

Hi Sjoerd,


That would be an excellent way of doing it and it would also map very well to MVE too, where we have a VCTP intrinsic/instruction that creates the mask/predicate (Vector Create Tail-Predicate). So I will go for this approach. Such an intrinsic was actually also proposed in Sam's original RFC (see https://lists.llvm.org/pipermail/llvm-dev/2019-May/132512.html), but we hadn't implemented it yet. This intrinsic will probably look something like this:

    <N x i1> @llvm.loop.get.active.mask(AnyInt, AnyInt)

It produces a <N x i1> predicate based on its two arguments, the number of elements and the vector trip count, and it will be used by the predicated masked loads/stores instructions in the vector body. I will start drafting an implementation for this and continue with this in D79100.

I'm curious about this, because this looks to me very similar to the code that -prefer-predicate-over-epilog is already emitting for the "outer mask" of a tail-folded loop.

The following code

void foo(int N, int *restrict c, int *restrict a, int *restrict b) {
#pragma clang loop vectorize(enable) interleave(disable)
  for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
  }
}

compiled with clang --target=x86_64 -mavx512f -mllvm -prefer-predicate-over-epilog -emit-llvm -O2 emits the following IR

vector.body:                                      ; preds = %vector.body, %for.body.preheader.new
  %index = phi i64 [ 0, %for.body.preheader.new ], [ %index.next.1, %vector.body ]
  %niter = phi i64 [ %unroll_iter, %for.body.preheader.new ], [ %niter.nsub.1, %vector.body ]
  %broadcast.splatinsert12 = insertelement <16 x i64> undef, i64 %index, i32 0
  %broadcast.splat13 = shufflevector <16 x i64> %broadcast.splatinsert12, <16 x i64> undef, <16 x i32> zeroinitializer
  %induction = or <16 x i64> %broadcast.splat13, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15>
  %4 = getelementptr inbounds i32, i32* %b, i64 %index
  %5 = icmp ule <16 x i64> %induction, %broadcast.splat
  ...
  %wide.masked.load = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* %6, i32 4, <16 x i1> %5, <16 x i32> undef), !tbaa !2

I understand %5 is not the same your proposed llvm.loop.get.active.mask would compute, is that correct? Can you elaborate on the difference here?

Thanks a lot,
Roger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200504/49cfe347/attachment.html>


More information about the llvm-dev mailing list