[PATCH] D50480: [LV] Vectorizing loops of arbitrary trip count without remainder under opt for size

Wed Aug 15 13:52:13 PDT 2018

hsaito added a comment.

In https://reviews.llvm.org/D50480#1201125, @reames wrote:

> We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve.  There are two obvious approaches: speculation safety and predication.  Unless I'm misreading this patch, it has the same problem and uses predication right?

In this particular case, we don't get much of speculation. If you call computing loop index beyond the original upper bound as speculation (and use it in compare), it is, but we know there aren't any safety issues. In your case, what really matters is inside "unlikely(i > M)". If that's just trivial "i > M" (or something that can be converted in that form), we are better off simply changing the loop upper bound and do so prior to hitting the vectorizer. Then, this patch will take care of it. If not (i.e., general compute_some_predicate_value_based_on(i)) the whole speculation safety issue comes up and that's the difficult part to deal with and this patch doesn't deal with any aspect of it.

>> There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
>>  do the unfinished work and then break. ICC's simd early_exit implements the latter.
> 
> Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right?  If there's not, then we don't need to distinguish which iteration "caused" the exit.  This is actually an interesting and useful subcase for me.

I don't know what you mean by "a use of a variable from within the loop down the early exit path".  Assume cond becomes true within a vector chunk (say, elem#2), you have to execute B for all prior iters (i.e., elem#0 and #1),
and execute A for elem #2.

  for (i){
     if (cond){
         A
         break;
     }
     B
  }

Assuming that B is lexically below (note: this is vectorization, as such, you need to have some lexical ordering assumption somewhere) all the early exit points, it can be non-speculatively executed under proper predication.
This kind of predication, however, has nothing to do with this patch. General IF-THEN-ELSE and GOTO based control flow needs the same kind of predication.

>> Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].
> 
> Unless I missing something, this is a restatement of the above right?

Sure ---- but unless you are talking about trivial (i.e., not very interesting) "early exit" stuff, how to deal with speculation is the most important aspect of vectorizer's early exit handling.

> Other examples are things like i < M for loop invariant M.  Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

Sure, but that's not very interesting from vectorization perspective. Vectorizer doesn't have to do what other loop transformation can handle.

> I am very specifically not interested in the language extension aspects.  I'm specifically asking about doing the transform for unannotated C code.  (i.e. having to prove all the legality the hard way)

ICC is doing it. So, let us know if anyone is volunteering before we do so that we can share our learning. It's an important aspect of vectorization but not yet high enough on our priority list. So, we aren't immediately jumping on to it.

>> If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.
> 
> I take it you don't see AVX512 as qualifying?

Qualifying to what?

If your question is whether ICC uses the masked main vector code for AVX512, other than OptForSize case, then the answer is yes it does.

It's a combination of HW and SW. If you know the trip count as a compile time constant, you can evaluate various different ways to vectorize and decide the best one, much better than when you don't know the trip count. The legacy part of LV isn't set up to do such an evaluation. VPlan native part of LV would eventually have such a capability. W/o this capability, we need to go one way or the other rather blindly --- and blindly changing the status quo requires a pretty good justification (like brilliantly fast masked vector execution unit). I'm more interested in doing the evaluation when VPlan native path is ready to do that.

>   Not surprised, but I'd be curious to hear your reasoning.  You might be coming at this from a different angle than I am  

If the trip count is unknown, the best AVX512 vectorization strategy so far is go with unmasked (at the top-level) vector main loop. Underlying assumption is that unmasked vector main loop is faster than the masked vector main loop, and a lot of time is spent in executing main vector loop. If such an assumption does not hold, like main vector code isn't executed a lot, programmers should try to communicate the trip count estimation to the compiler so that the compiler can do a better job. As the HW narrows the gap between the two, optimization point moves. We have to evaluate every generation of HW and see what works the best. So, my comment applies to today's HW. I don't know what ARM SVE folks would say for their HW.

Does this make sense to you?

Repository:
  rL LLVM

https://reviews.llvm.org/D50480