[PATCH] D75746: [LoopVectorizer] Simplify branch in the remainder loop for trivial cases

Ayal Zaks via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Apr 6 13:04:46 PDT 2020


Ayal added inline comments.


================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3107
+  // If VFxUF is 2 and vector loop is not skipped then remainder executes once.
+  if (VF * UF == 2 && !areSafetyChecksAdded()) {
+    if (BasicBlock *Latch = OrigLoop->getLoopLatch())
----------------
danilaml wrote:
> Ayal wrote:
> > danilaml wrote:
> > > Ayal wrote:
> > > > danilaml wrote:
> > > > > Ayal wrote:
> > > > > > danilaml wrote:
> > > > > > > I think this check is enough unless there are other cases in which "remainder loop has `N % (VF*UF)` iterations doesn't hold.
> > > > > > Note that the trip count of the remainder loop may be **equal** to VF*UF, when loop `requiresScalarEpilogue()`; so in the above special case of VF*UF==2 remainder loop may iterate once **or twice**.
> > > > > > 
> > > > > > Note that `emitMinimumIterationCountCheck()` also takes care of the case where adding 1 to the backedge-taken count overflows, leading to an incorrect trip count of zero; here too "remainder" loop iterates (much) more than once.
> > > > > Thanks. I knew I might've missed something. This makes me more skeptical about potential llvm.assume solution.
> > > > > Am I understanding your note correctly, that adding requiresScalarEpilogue check is enough?
> > > > Adding `requiresScalarEpilogue()` check is enough to handle the first case, but not the second case.
> > > > One way to try and handle the second case is to change `getOrCreateVectorTripCount()` so that it relies on `PSE::getBackedgeTakenCount()` w/o adding 1 to it, as this addition (done by `getOrCreateTripCount()`) may overflow to zero.
> > > > See r209854, and the `max_i32_backedgetaken()` test it added to test/Transforms/LoopVectorize/induction.ll.
> > > > 
> > > > Another way may be to check if/when adding 1 is known not to overflow.
> > > I haven't figured out how to make VectorTripCount reliant solely on getBackedgeTakenCount without introducing extra checks/instructions (which seems to be the rationale behind the current code, instead of the one in r209854, at the expense of not vectorizing the rare UINT_MAX loops).
> > > 
> > > Instead, I've opted in checking whether the overflow might've occurred by using getConstantMaxBackedgeTakenCount. If I understood things correctly, it would return -1 (all ones) in the overflow case.
> > A sketch of making VectorTripCount work correctly for loops with BTC= UINTMAX as well:
> > 
> > ```
> >   Step = VF*UF
> >   BTC = PSE::BackedgeTakenCount()
> >   N = BTC + 1                     // could overflow to 0 so do not compute N % Step
> >   if (foldTail) N = N + (Step-1)  // for rounding up
> >   R = BTC % Step                  // Fits foldTail: (N+Step-1)%Step == (BTC+1+Step-1)%Step == (BTC+Step)%Step == BTC%Step
> >   if !(foldTail) { R = R + 1      // Fits requiresScalarEpilog: produces 0 < R <= Step 
> >     if !(requiresScalarEpilog) R = (R == Step ? 0 : R) == R % Step}
> >   VectorTripCount = N - R        
> > 
> > ```
> > which could be optimized into
> > 
> > ```
> >   Step = VF*UF
> >   BTC = PSE::BackedgeTakenCount()
> >   R = BTC % Step
> >   VTC = BTC - R
> >   if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
> >   if (foldTail) VTC = VTC + Step
> >   VectorTripCount = VTC
> > ```
> > This also allows foldTail to work with Steps (i.e., UF's) that are not a power of 2.
> Hm, but doesn't it introduce additional instructions/compares that might lead to worse CodeGen?
> I.e. in the common case, BTC will need to be computed, when `BTC - R`, when `select` from `VTC` and `VTC+Step`.
> whereas currently it is just `VectorTripCount = TC - (TC % Step)`, where TC is already computed.
Hopefully the code-size increase and the slowdown will be insignificant, being outside the loop, but indeed they need to be checked.

Another option is to use the fact that, although BTC+1 might overflow and wrap to zero, BTC-(Step-1) **may not underflow** if !foldTail thanks to the min.iters.check of TripCount = N = BTC+1 >= Step, and therefore (BTC+1)(w/o overflow) === (BTC-(Step-1)) modulo Step.
So for a common case of !foldTail && !requiresScalarEpilog, VectorTripCount can be computed w/o risk of overflow or underflow using `N-((N-Step)%Step)` instead of the current `N - (N % Step)`.

For completeness, current VectorTripCount is computed by:

```
  Step = VF*UF
  BTC = PSE::BackedgeTakenCount()
  N = BTC + 1
  if (foldTail) N = N + (Step-1)
  R = N % Step
  if (requiresScalarEpilog) R = (R == 0 ? Step : R)
  VectorTripCount = N - R 
```



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75746/new/

https://reviews.llvm.org/D75746





More information about the llvm-commits mailing list