[PATCH] D75746: [LoopVectorizer] Simplify branch in the remainder loop for trivial cases

Wed Jun 9 14:30:10 PDT 2021

Ayal added a comment.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3107
+  // If VFxUF is 2 and vector loop is not skipped then remainder executes once.
+  if (VF * UF == 2 && !areSafetyChecksAdded()) {
+    if (BasicBlock *Latch = OrigLoop->getLoopLatch())
----------------
danilaml wrote:
> Ayal wrote:
> > danilaml wrote:
> > > Ayal wrote:
> > > > danilaml wrote:
> > > > > Ayal wrote:
> > > > > > danilaml wrote:
> > > > > > > Ayal wrote:
> > > > > > > > danilaml wrote:
> > > > > > > > > I think this check is enough unless there are other cases in which "remainder loop has `N % (VF*UF)` iterations doesn't hold.
> > > > > > > > Note that the trip count of the remainder loop may be **equal** to VF*UF, when loop `requiresScalarEpilogue()`; so in the above special case of VF*UF==2 remainder loop may iterate once **or twice**.
> > > > > > > > 
> > > > > > > > Note that `emitMinimumIterationCountCheck()` also takes care of the case where adding 1 to the backedge-taken count overflows, leading to an incorrect trip count of zero; here too "remainder" loop iterates (much) more than once.
> > > > > > > Thanks. I knew I might've missed something. This makes me more skeptical about potential llvm.assume solution.
> > > > > > > Am I understanding your note correctly, that adding requiresScalarEpilogue check is enough?
> > > > > > Adding `requiresScalarEpilogue()` check is enough to handle the first case, but not the second case.
> > > > > > One way to try and handle the second case is to change `getOrCreateVectorTripCount()` so that it relies on `PSE::getBackedgeTakenCount()` w/o adding 1 to it, as this addition (done by `getOrCreateTripCount()`) may overflow to zero.
> > > > > > See r209854, and the `max_i32_backedgetaken()` test it added to test/Transforms/LoopVectorize/induction.ll.
> > > > > > 
> > > > > > Another way may be to check if/when adding 1 is known not to overflow.
> > > > > I haven't figured out how to make VectorTripCount reliant solely on getBackedgeTakenCount without introducing extra checks/instructions (which seems to be the rationale behind the current code, instead of the one in r209854, at the expense of not vectorizing the rare UINT_MAX loops).
> > > > > 
> > > > > Instead, I've opted in checking whether the overflow might've occurred by using getConstantMaxBackedgeTakenCount. If I understood things correctly, it would return -1 (all ones) in the overflow case.
> > > > A sketch of making VectorTripCount work correctly for loops with BTC= UINTMAX as well:
> > > > 
> > > > ```
> > > >   Step = VF*UF
> > > >   BTC = PSE::BackedgeTakenCount()
> > > >   N = BTC + 1                     // could overflow to 0 so do not compute N % Step
> > > >   if (foldTail) N = N + (Step-1)  // for rounding up
> > > >   R = BTC % Step                  // Fits foldTail: (N+Step-1)%Step == (BTC+1+Step-1)%Step == (BTC+Step)%Step == BTC%Step
> > > >   if !(foldTail) { R = R + 1      // Fits requiresScalarEpilog: produces 0 < R <= Step 
> > > >     if !(requiresScalarEpilog) R = (R == Step ? 0 : R) == R % Step}
> > > >   VectorTripCount = N - R        
> > > > 
> > > > ```
> > > > which could be optimized into
> > > > 
> > > > ```
> > > >   Step = VF*UF
> > > >   BTC = PSE::BackedgeTakenCount()
> > > >   R = BTC % Step
> > > >   VTC = BTC - R
> > > >   if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
> > > >   if (foldTail) VTC = VTC + Step
> > > >   VectorTripCount = VTC
> > > > ```
> > > > This also allows foldTail to work with Steps (i.e., UF's) that are not a power of 2.
> > > Hm, but doesn't it introduce additional instructions/compares that might lead to worse CodeGen?
> > > I.e. in the common case, BTC will need to be computed, when `BTC - R`, when `select` from `VTC` and `VTC+Step`.
> > > whereas currently it is just `VectorTripCount = TC - (TC % Step)`, where TC is already computed.
> > Hopefully the code-size increase and the slowdown will be insignificant, being outside the loop, but indeed they need to be checked.
> > 
> > Another option is to use the fact that, although BTC+1 might overflow and wrap to zero, BTC-(Step-1) **may not underflow** if !foldTail thanks to the min.iters.check of TripCount = N = BTC+1 >= Step, and therefore (BTC+1)(w/o overflow) === (BTC-(Step-1)) modulo Step.
> > So for a common case of !foldTail && !requiresScalarEpilog, VectorTripCount can be computed w/o risk of overflow or underflow using `N-((N-Step)%Step)` instead of the current `N - (N % Step)`.
> > 
> > For completeness, current VectorTripCount is computed by:
> > 
> > ```
> >   Step = VF*UF
> >   BTC = PSE::BackedgeTakenCount()
> >   N = BTC + 1
> >   if (foldTail) N = N + (Step-1)
> >   R = N % Step
> >   if (requiresScalarEpilog) R = (R == 0 ? Step : R)
> >   VectorTripCount = N - R 
> > ```
> > 
> I'm not sure what using `N-((N-Step)%Step)` instead of `N - (N % Step)` is supposed to solve.
> I'm not sure what using `N-((N-Step)%Step)` instead of `N - (N % Step)` is supposed to solve.

Currently if BTC = UINTMAX the vector loop is bypassed and the scalar "remainder" loop executes all iterations, under !foldTail.
In order for loops requiring no runtime checks besides min.iters.check to execute the vector loop (as much as possible) also for BTC = UINTMAX, thereby ensuring that the scalar loop always executes at most Step-1 iterations -- //the original motivation for this patch// -- the following may be done:

(1) Change min.iters.check to check if `BTC >= Step-1` instead of checking if `N = BTC+1 >= Step`. The latter overflows for BTC = UINTMAX thereby bypassing the vector loop, whereas the former does not wrap.
(2) Compute the VectorTripCount using `N-((N-Step)%Step)` instead of `N - (N % Step)`. The latter produces zero when BTC = UINTMAX for any Step, which is incorrect for Steps that do not divide BTC+1, i.e., for non-power-of-2 UFs.

With requiresScalarEpilog, apply the `R = (R == 0 ? Step : R)` to `R = ((N-Step)%Step)` before subtracting it from N.

Note that VectorTripCount computed in (2) may overflow to zero, i.e., for BTC = UINTMAX and Step(UF) that is a power-of-2. This works correctly, as currently done with foldTail, where min.iters.check is eliminated, and UF is required to be a power-of-2. With foldTail, use `R = ((N-1)%Step) = BTC%Step` as suggested earlier above, which never wraps. In any case, this patch focuses on the tail.

To summarize:

```
Step = VF*UF
BTC = PSE::BackedgeTakenCount()
; min.iters.check:
if (!foldTail): if (BTC < Step-1) goto scalar loop
N = BTC+1   ; may overflow to zero
if (foldTail): R = BTC % Step
if (!foldTail): R = (N-Step) % Step
if (requiresScalarEpilog): R = (R == 0 ? Step : R)
VectorTripCount = N - R
```

Hopefully this way of ensuring that the tail scalar loop always executes less than Step iterations, also for non-power-of-2 Steps, has no significant negative performance impact.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75746/new/

https://reviews.llvm.org/D75746