[PATCH] D79976: [LV] Handle Fold-Tail of loops with vectorizarion factor (VF) equal to 1

Sun May 17 09:02:46 PDT 2020

Ayal added inline comments.

================
Comment at: llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1-scalar.ll:17
+; CHECK:         [[INDEX_NEXT]] = add i64 [[INDEX]], 4
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
+; CHECK-NEXT:    br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
----------------
anhtuyen wrote:
> anhtuyen wrote:
> > fhahn wrote:
> > > bmahjour wrote:
> > > > anhtuyen wrote:
> > > > > Ayal wrote:
> > > > > > bmahjour wrote:
> > > > > > > How is it that the original loop executes 15 iterations, but the vector loop iterates 16? It seems the minimum iteration count check branch at the top should branch to the scalar loop instead of vector.ph.
> > > > > > (Thanks for asking, reminded to check above and below that fold-tail emits the desired scalar `icmp ule`'s, which are the focus of this patch.)
> > > > > > 
> > > > > > Fold-tail is responsible for rounding-up the trip count from 15 to 16, see https://reviews.llvm.org/D50480.
> > > > > > Regarding minimum iteration count check branch, fold-tail is also responsible in general for branching directly to vector.ph w/o an "if (trip-count < VF*UF)", which in this case is known to be false anyhow.
> > > > > Ayal @Ayal , thank you very much for your help to clarify. My guess is that Bardia @bmahjour might have been concerned whether by any chance the effective addition of the 16th iteration would affect the correctness of the generated code. Because I have neither evidence nor counterevidence to address his concern, if you can shed some light on it when you have some time, that will be great.  
> > > > > 
> > > > > Back to this patch but given the fact that the value of the trip-count is not the main focus hereof, I take the liberty of omitting its value from the checks. If that is not acceptable to any of us here, please let me know. 
> > > > > 
> > > > I guess I would understand how the rounding-up would work, if the instructions in the body were somehow predicated, but I don't see any predication in the output IR. Is that because there are no instructions in this test case with side-effects, read/writes, etc?
> > > yes, this would probably need some actual instructions in the loop body, so there is something to predicate. @anhtuyen could you add a small vectorizable body, e.g. just storing to ptr+induction?
> > I will add it later tonight.
> Hi Florian @fhahn,  This is testcase, which I modified along the way you had suggested. It has a trip count of 15. However, I cannot make it go through Fold-Tail.
> 
> ```
> define void @foo(double* %paramP) {
> entry:
>   %localP = getelementptr inbounds double, double* %paramP, i64 15
>   br label %for.body
> 
> for.cond.cleanup:
>   ret void
> 
> for.body:
>   %addr = phi double* [ %indP, %for.body ], [ %paramP, %entry ]
>   %indP = getelementptr inbounds double, double* %addr, i64 1
>   %cond = icmp eq double* %indP, %localP
>   store double 3.14, double* %indP              ;<======== I added this line
>   br i1 %cond, label %for.cond.cleanup, label %for.body
> }
> ```
> Do you have some hints, which I could use ?
> However, I cannot make it go through Fold-Tail.

That is due to bug in Fold-Tail, namely PR45679, which should be fixed by D80085. Specifically, using only type `double` on default target leads to internally computed MaxVF=1.
The test added in D80085, pr45679-fold-tail-by-masking.ll, should be extended by this patch to also handle VF=1, UF=4 case.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79976/new/

https://reviews.llvm.org/D79976