[llvm-dev] Vectorizing multiple exit loops

Mon Sep 9 10:53:36 PDT 2019

I've recently mentioned in a few places that I'm interested in enhancing 
the loop vectorizer to handle multiple exit loops, and have been asked 
to share plans.  This email is intended to a) share my current thinking 
and b) help spark discussion among interested parties.  I do need to 
warn that my near term plans for this have been delayed; I got pulled 
into an internal project instead.

*Background*

At the moment, our loop vectorizer does not handle any loop with more 
than one exit, or where that sole exit is not from the latch block.  
Interestingly, it does handle internal predication within a single loop 
iteration.  This results in a slightly odd behavior where a loop which 
can be written with either a continue or a break can exhibit wildly 
different performance depending on that choice.  It does hint at a 
possible direction for implementation though, and implies that most of 
the cost modeling pieces are already in place.

The use cases I'm looking at basically fall into two buckets:

for (int i = 0; i < N; i++) {
   if (expr(a[i])) break;
   ... other vectorizable stuff ...
}

for (int i = 0; i < N; i++) {
   if (expr(i)) break;
   ... other vectorizable stuff ...
}

The former are the actually "interesting" examples.  The later are cases 
where we missed eliminating some check we could have, but 
not-vectorizing creates an unfortunate performance cliff.

*Framing*

We have three important sub-cases to handle.

First, there are all the cases where we could have handled the multiple 
exit loop, but chose not to for implementation reasons. A great example is:

for (int i = 0; i < N; i++) {
   if (i > M) break;
   a[i] = i;
}

In this case, SCEV can happily compute the actual exit bound of the 
loop, and we could use the existing vector-body/scalar slow-path 
structure.  The only change needed would be to exit the vector body 
earlier.  (There are some details here, but it's almost as easy as I'm 
describing if my POC patch isn't missing something major.)

There's a couple other cases like this.  I suspect we can get decent 
mileage out of just generalizing the existing code.

Second, there are the cases where we actually have to handle iterations 
within the vector-body with predication (or speculation).  The good news 
is that there's already support in code for predication, we just need to 
add another source of potential predication.  The bad news is that's a 
fairly major change.

Our challenge will be finding a runtime check for dereferenceability.  
Consider the following example:
for (int i = 0; i < N; i++)
   if (a[i] == 0) return false;
return true;

If 'a' is an alloca, or statically sized global, we can form a runtime 
check which ensures 'i' is in bounds and a speculative load will not fault.

Here's a nice example we could handle with this approach.

// Assume a and b are both statically sized.
for (int i = 0; i < N; i++) {
   t = b[i];
   if (t > M) throw();
   sum += a[t];
}

(This is a classic a[b[i]] pattern, but with range checks added.)

This is broadly applicable enough to be useful, and in practice covers 
my use cases, but I'm hoping others have ideas for extensions here.

Before resorting to that though, we have the potential to rely more on 
speculation safety reasoning.  I have a patch out for review currently 
(D66688 [LoopVectorize] Leverage speculation safety to avoid 
masked.loads <https://reviews.llvm.org/D66688>) which fell out of some 
prototyping in this area; it benefits existing predication logic, so I 
separated it.

The other major challenge here is profitability.  Consider a simple loop 
such as:

// assume a[0:N] is known dereferenceable
for (int i = 0; i < N; i++)
   if (a[i] == 0) return false;
return true;

If N is large, and the array is non-zero, then this is profitable to 
vectorize.  If a[0] == 0, then it isn't, regardless of the value of N.

Figuring out when to vectorize vs not for cases like this will require 
some thought.  I don't really have a good answer for this yet, other 
than when the profile on the early exit tells us it's rarely taken.

Third, for both of the former cases, we need to be able to compute exit 
values along the early exit edges.  We can get a lot of mileage out of 
simply skipping loops with exit values (i.e. lcssa phis), as our loop 
exit value rewriting will tend to eliminate them.  However, we will 
eventually want to handle this case, which will require generating some 
interesting complicated code down the exit path to figure out which 
iteration actually exit.

I see two general options here:

1) Use the vector-body/scalar-body idiom of today, and have the vector 
body exit with IV = I when any iteration in [I, I+VF-1] would exit.  
(i.e. roll back)

2) Insert dedicated exit blocks which recompute exit conditions to 
determine exact exit value, and then let the vector body run all 
iterations in VF which contain the exiting iteration.  (Assumes 
predicated stores, and that the exit blocks skip the scalar loop)

I currently don't have a reason to strongly prefer one over the other.  
(2) is conceptually cleaner and the one which keeps coming up in 
conversation, but (1) may be simpler to actually implement.

*Current Plans*

At the current moment, I'm reasonable sure that I'm going to get the 
resources to at least tackle some of the cases where we bail out 
unnecessarily.  This will be a huge practical improvement in vectorizing 
robustness, at (I hope) relatively little implementation cost.

I'm also going to continue playing around with enhancements to our 
current dereferenceability logic.  I see that as being a key building 
block to make any predication based approach practical.

I'm not sure I'm going to get to the predication support.  I'd like to, 
but am not sure my resource constraints allow it.  I'll also mention 
that I'm not at all sure how all of this might fit in with the VPLAN 
work.  I'd really welcome feedback on that; is what I'm proposing at all 
consistent with others plans?

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190909/1da080f1/attachment.html>