[PATCH] D50480: [LV] Vectorizing loops of arbitrary trip count without remainder under opt for size

Wed Aug 15 11:44:28 PDT 2018

reames added a comment.

In https://reviews.llvm.org/D50480#1200014, @hsaito wrote:

> In https://reviews.llvm.org/D50480#1199900, @reames wrote:
>
> > I have a general question about direction, not specific to this patch.
> >
> > It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms.  What are our plans in terms of supporting more general predication?  For instance, I don't believe we handle loops like the following at the moment:
> >  for (int i = 0; i < N; i++) {
> >
> >   if (unlikely(i > M)) 
> >      break;
> >   sum += a[i];
> >
> > }
> >
> > Can the infrastructure in this patch be generalized to handle such cases?  And if so, are their any specific plans to do so?
>
>
> Short answer is No.
>
> From vectorizer perspective, mechanics is quite different.

Ok, I think we're talking past each other a bit.  I see these both as forms of predication.  It sounds like you have a slightly different view; I'll try to ask clarifying questions in the right spots.  I think we have different mental models here and I'm trying to understand where that difference is.

> In the Intel compiler (ICC) 18.0, we implemented "#pragma omp simd early_exit", to handle this situation in somewhat more general manner. Hopefully, the syntax will be standardized in the future and more compilers will implement it.

I'm unfamiliar with this pragma, but the best reference I found was https://software.intel.com/en-us/fortran-compiler-18.0-developer-guide-and-reference-simd-directive-openmp-api

>From what I can tell, this provides user guarantees of a couple of legality checks and profitability checks.  I don't know enough about openmp to completely follow all the wording, but the key bit appears to be this:
"Each operation before the last lexical early exit of the loop may be executed as if the early exit were not triggered within the SIMD chunk."

We obviously don't get this guarantee and thus there's a legality question here the vectorizer would have to solve.  There are two obvious approaches: speculation safety and predication.  Unless I'm misreading this patch, it has the same problem and uses predication right?

> There are two ways to think. 1) If the vector condition is not all false (i.e., break is taken for some element), take the break and let scalar code do the unfinished work. 2) If the vector condition is not all false (i.e., break is taken for some element), let vector code
>  do the unfinished work and then break. ICC's simd early_exit implements the latter.

Just to confirm, this is only needed if there's a use of a variable from within the loop down the early exit path right?  If there's not, then we don't need to distinguish which iteration "caused" the exit.  This is actually an interesting and useful subcase for me.

> Either way, it's best not to think along the lines of this (rather simple) patch. Please note that even the determination of exit condition often involves speculation, and compiler somehow needs to ensure such speculation is safe (or let the programmer assert like ICC's "simd early_exit"). Simple "if (A[i]>0) break", for example, involves speculation in the vector load of A[i].

Unless I missing something, this is a restatement of the above right?

I agree that cases like a[i] >0 are the hard ones.  Other examples are things like i < M for loop invariant M.  Provided we can compute all values of i in the next vector iteration without faulting (usually doable), we can do the vector check to form our predicate.

> From our perspective, bringing OpenMP4.5 functionality to LLVM is higher priority than bringing early_exit extension. If anyone wants to work on simd early_exit in LLVM, we are more than happy to share our learning. Please let us know.

I am very specifically not interested in the language extension aspects.  I'm specifically asking about doing the transform for unannotated C code.  (i.e. having to prove all the legality the hard way)

>> Secondly, are there any plans to enable this approach for anything other than optsize?
> 
> If someone has a brilliantly fast masked vector execution unit, that would be a possibility. As a vectorizer person, that would be a dream comes true ---- smaller code, faster compile, and faster execution. Looking forward to hear such a great news.

I take it you don't see AVX512 as qualifying?  Not surprised, but I'd be curious to hear your reasoning.  You might be coming at this from a different angle than I am

Repository:
  rL LLVM

https://reviews.llvm.org/D50480