# [llvm-dev] Vectorizing multiple exit loops

Philip Reames via llvm-dev llvm-dev at lists.llvm.org
Mon Sep 9 10:53:36 PDT 2019

```I've recently mentioned in a few places that I'm interested in enhancing
the loop vectorizer to handle multiple exit loops, and have been asked
to share plans.  This email is intended to a) share my current thinking
and b) help spark discussion among interested parties.  I do need to
warn that my near term plans for this have been delayed; I got pulled

*Background*

At the moment, our loop vectorizer does not handle any loop with more
than one exit, or where that sole exit is not from the latch block.
Interestingly, it does handle internal predication within a single loop
iteration.  This results in a slightly odd behavior where a loop which
can be written with either a continue or a break can exhibit wildly
different performance depending on that choice.  It does hint at a
possible direction for implementation though, and implies that most of
the cost modeling pieces are already in place.

The use cases I'm looking at basically fall into two buckets:

for (int i = 0; i < N; i++) {
if (expr(a[i])) break;
... other vectorizable stuff ...
}

for (int i = 0; i < N; i++) {
if (expr(i)) break;
... other vectorizable stuff ...
}

The former are the actually "interesting" examples.  The later are cases
where we missed eliminating some check we could have, but
not-vectorizing creates an unfortunate performance cliff.

*Framing*

We have three important sub-cases to handle.

First, there are all the cases where we could have handled the multiple
exit loop, but chose not to for implementation reasons. A great example is:

for (int i = 0; i < N; i++) {
if (i > M) break;
a[i] = i;
}

In this case, SCEV can happily compute the actual exit bound of the
loop, and we could use the existing vector-body/scalar slow-path
structure.  The only change needed would be to exit the vector body
earlier.  (There are some details here, but it's almost as easy as I'm
describing if my POC patch isn't missing something major.)

There's a couple other cases like this.  I suspect we can get decent
mileage out of just generalizing the existing code.

Second, there are the cases where we actually have to handle iterations
within the vector-body with predication (or speculation).  The good news
is that there's already support in code for predication, we just need to
add another source of potential predication.  The bad news is that's a
fairly major change.

Our challenge will be finding a runtime check for dereferenceability.
Consider the following example:
for (int i = 0; i < N; i++)
if (a[i] == 0) return false;
return true;

If 'a' is an alloca, or statically sized global, we can form a runtime
check which ensures 'i' is in bounds and a speculative load will not fault.

Here's a nice example we could handle with this approach.

// Assume a and b are both statically sized.
for (int i = 0; i < N; i++) {
t = b[i];
if (t > M) throw();
sum += a[t];
}

(This is a classic a[b[i]] pattern, but with range checks added.)

This is broadly applicable enough to be useful, and in practice covers
my use cases, but I'm hoping others have ideas for extensions here.

Before resorting to that though, we have the potential to rely more on
speculation safety reasoning.  I have a patch out for review currently
(D66688 [LoopVectorize] Leverage speculation safety to avoid
prototyping in this area; it benefits existing predication logic, so I
separated it.

The other major challenge here is profitability.  Consider a simple loop
such as:

// assume a[0:N] is known dereferenceable
for (int i = 0; i < N; i++)
if (a[i] == 0) return false;
return true;

If N is large, and the array is non-zero, then this is profitable to
vectorize.  If a[0] == 0, then it isn't, regardless of the value of N.

Figuring out when to vectorize vs not for cases like this will require
some thought.  I don't really have a good answer for this yet, other
than when the profile on the early exit tells us it's rarely taken.

Third, for both of the former cases, we need to be able to compute exit
values along the early exit edges.  We can get a lot of mileage out of
simply skipping loops with exit values (i.e. lcssa phis), as our loop
exit value rewriting will tend to eliminate them.  However, we will
eventually want to handle this case, which will require generating some
interesting complicated code down the exit path to figure out which
iteration actually exit.

I see two general options here:

1) Use the vector-body/scalar-body idiom of today, and have the vector
body exit with IV = I when any iteration in [I, I+VF-1] would exit.
(i.e. roll back)

2) Insert dedicated exit blocks which recompute exit conditions to
determine exact exit value, and then let the vector body run all
iterations in VF which contain the exiting iteration.  (Assumes
predicated stores, and that the exit blocks skip the scalar loop)

I currently don't have a reason to strongly prefer one over the other.
(2) is conceptually cleaner and the one which keeps coming up in
conversation, but (1) may be simpler to actually implement.

*Current Plans*

At the current moment, I'm reasonable sure that I'm going to get the
resources to at least tackle some of the cases where we bail out
unnecessarily.  This will be a huge practical improvement in vectorizing
robustness, at (I hope) relatively little implementation cost.

I'm also going to continue playing around with enhancements to our
current dereferenceability logic.  I see that as being a key building
block to make any predication based approach practical.

I'm not sure I'm going to get to the predication support.  I'd like to,
but am not sure my resource constraints allow it.  I'll also mention
that I'm not at all sure how all of this might fit in with the VPLAN
work.  I'd really welcome feedback on that; is what I'm proposing at all
consistent with others plans?

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190909/1da080f1/attachment.html>
```