[PATCH] D27690: [LV] Don't vectorize when we have a static bound on trip count

Tue Dec 13 10:58:54 PST 2016

mkuper added a comment.

In https://reviews.llvm.org/D27690#620834, @hfinkel wrote:

> Can you please remind me why we're doing this? A loop with a small trip count should be unrolled if it is small, so we should really only be seeing here loops with small trip counts that have large bodies. That being the case, if vectorization looks profitable, why isn't it?

There are two separate reasons it may not be profitable:

1. The one-time setup overhead for the vectorized loop - broadcasts, bounds computations for runtime checks, etc. is too high w.r.t the gain from vectorization.
2. We should never use a VF that is higher than the trip-count, since if we do, we're only paying the price for the overhead, while in practice we'll only execute the scalar remainder loop.  This is, of course, completely independent of the size of the loop body. (That may change if we start using masked vector remainder loops, but I think Intel experimented with that, and didn't get very good results, so I don't see this happening in the near future.)

For (1), what we should do, if we want to be precise, is compute the overhead cost, and then take both the cost and the expected number of iterations into account as part of the cost model. As to (2), the right way to deal with it would be to upper-bound the VF choice by the trip-count.

However, since we don't really model the overhead correctly what we have (and have had since forever) is this big hammer of just bailing out on loops with a small trip count. What I'm doing is extending this to uniformly cover all kinds of loops - that is, not only loops with a known exact static trip count, but also dynamic trip counts (for PGO builds) and static trip count bounds.

My main motivations for that are:

a) Loops that have an unknown static trip-count, but according to the profile have very low iteration counts (0 or 1). These should not be unrolled, and should never be vectorized either.

b) Remainder loops for hand-vectorized code. These will also not be unrolled - the trip-count is unknown, and doesn't have a known multiple. (We may end up with runtime unrolling and yet another "remainder loop", which doesn't really improve things.) And, of course, it's almost always a bad idea to vectorize these. (The exception may be something like hand-vectorization by 16, with a scalar remainder loop. We may want to vectorize that remainder by 4 and leave a smaller scalar remainder, but that sounds like a very small win.)

I agree that we probably want to fix (1) eventually, but that will still need a good trip count estimate, so it's in some sense orthogonal to improving the estimate in cases where it's not statically known exactly.

https://reviews.llvm.org/D27690