[PATCH] D75746: [LoopVectorizer] Simplify branch in the remainder loop for trivial cases

Fri Apr 3 07:29:59 PDT 2020

Ayal added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3107
+  // If VFxUF is 2 and vector loop is not skipped then remainder executes once.
+  if (VF * UF == 2 && !areSafetyChecksAdded()) {
+    if (BasicBlock *Latch = OrigLoop->getLoopLatch())
----------------
danilaml wrote:
> Ayal wrote:
> > danilaml wrote:
> > > Ayal wrote:
> > > > danilaml wrote:
> > > > > I think this check is enough unless there are other cases in which "remainder loop has `N % (VF*UF)` iterations doesn't hold.
> > > > Note that the trip count of the remainder loop may be **equal** to VF*UF, when loop `requiresScalarEpilogue()`; so in the above special case of VF*UF==2 remainder loop may iterate once **or twice**.
> > > > 
> > > > Note that `emitMinimumIterationCountCheck()` also takes care of the case where adding 1 to the backedge-taken count overflows, leading to an incorrect trip count of zero; here too "remainder" loop iterates (much) more than once.
> > > Thanks. I knew I might've missed something. This makes me more skeptical about potential llvm.assume solution.
> > > Am I understanding your note correctly, that adding requiresScalarEpilogue check is enough?
> > Adding `requiresScalarEpilogue()` check is enough to handle the first case, but not the second case.
> > One way to try and handle the second case is to change `getOrCreateVectorTripCount()` so that it relies on `PSE::getBackedgeTakenCount()` w/o adding 1 to it, as this addition (done by `getOrCreateTripCount()`) may overflow to zero.
> > See r209854, and the `max_i32_backedgetaken()` test it added to test/Transforms/LoopVectorize/induction.ll.
> > 
> > Another way may be to check if/when adding 1 is known not to overflow.
> I haven't figured out how to make VectorTripCount reliant solely on getBackedgeTakenCount without introducing extra checks/instructions (which seems to be the rationale behind the current code, instead of the one in r209854, at the expense of not vectorizing the rare UINT_MAX loops).
> 
> Instead, I've opted in checking whether the overflow might've occurred by using getConstantMaxBackedgeTakenCount. If I understood things correctly, it would return -1 (all ones) in the overflow case.
A sketch of making VectorTripCount work correctly for loops with BTC= UINTMAX as well:

```
  Step = VF*UF
  BTC = PSE::BackedgeTakenCount()
  N = BTC + 1                     // could overflow to 0 so do not compute N % Step
  if (foldTail) N = N + (Step-1)  // for rounding up
  R = BTC % Step                  // Fits foldTail: (N+Step-1)%Step == (BTC+1+Step-1)%Step == (BTC+Step)%Step == BTC%Step
  if !(foldTail) { R = R + 1      // Fits requiresScalarEpilog: produces 0 < R <= Step 
    if !(requiresScalarEpilog) R = (R == Step ? 0 : R) == R % Step}
  VectorTripCount = N - R        

```
which could be optimized into

```
  Step = VF*UF
  BTC = PSE::BackedgeTakenCount()
  R = BTC % Step
  VTC = BTC - R
  if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
  if (foldTail) VTC = VTC + Step
  VectorTripCount = VTC
```
This also allows foldTail to work with Steps (i.e., UF's) that are not a power of 2.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75746/new/

https://reviews.llvm.org/D75746