[PATCH] D75746: [LoopVectorizer] Simplify branch in the remainder loop for trivial cases

Tue Oct 6 07:00:15 PDT 2020

danilaml added a comment.

Rebased.
Had some time to come back to this.
I've tried implementing the following proposed way to get VTC locally:

  Step = VF*UF
  BTC = PSE::BackedgeTakenCount()
  R = BTC % Step
  VTC = BTC - R
  if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
  if (foldTail) VTC = VTC + Step
  VectorTripCount = VTC

But it had generally a negative performance impact on the benchmarks I've run mainly due to 1) more instructions required to compute VTC 2) `VTC = (R == Step-1 ? VTC + Step : VTC)` select for common case making vectorized loop's Exit Count uncomputable thus impeding some later optimizations.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3107
+  // If VFxUF is 2 and vector loop is not skipped then remainder executes once.
+  if (VF * UF == 2 && !areSafetyChecksAdded()) {
+    if (BasicBlock *Latch = OrigLoop->getLoopLatch())
----------------
Ayal wrote:
> danilaml wrote:
> > Ayal wrote:
> > > danilaml wrote:
> > > > Ayal wrote:
> > > > > danilaml wrote:
> > > > > > Ayal wrote:
> > > > > > > danilaml wrote:
> > > > > > > > I think this check is enough unless there are other cases in which "remainder loop has `N % (VF*UF)` iterations doesn't hold.
> > > > > > > Note that the trip count of the remainder loop may be **equal** to VF*UF, when loop `requiresScalarEpilogue()`; so in the above special case of VF*UF==2 remainder loop may iterate once **or twice**.
> > > > > > > 
> > > > > > > Note that `emitMinimumIterationCountCheck()` also takes care of the case where adding 1 to the backedge-taken count overflows, leading to an incorrect trip count of zero; here too "remainder" loop iterates (much) more than once.
> > > > > > Thanks. I knew I might've missed something. This makes me more skeptical about potential llvm.assume solution.
> > > > > > Am I understanding your note correctly, that adding requiresScalarEpilogue check is enough?
> > > > > Adding `requiresScalarEpilogue()` check is enough to handle the first case, but not the second case.
> > > > > One way to try and handle the second case is to change `getOrCreateVectorTripCount()` so that it relies on `PSE::getBackedgeTakenCount()` w/o adding 1 to it, as this addition (done by `getOrCreateTripCount()`) may overflow to zero.
> > > > > See r209854, and the `max_i32_backedgetaken()` test it added to test/Transforms/LoopVectorize/induction.ll.
> > > > > 
> > > > > Another way may be to check if/when adding 1 is known not to overflow.
> > > > I haven't figured out how to make VectorTripCount reliant solely on getBackedgeTakenCount without introducing extra checks/instructions (which seems to be the rationale behind the current code, instead of the one in r209854, at the expense of not vectorizing the rare UINT_MAX loops).
> > > > 
> > > > Instead, I've opted in checking whether the overflow might've occurred by using getConstantMaxBackedgeTakenCount. If I understood things correctly, it would return -1 (all ones) in the overflow case.
> > > A sketch of making VectorTripCount work correctly for loops with BTC= UINTMAX as well:
> > > 
> > > ```
> > >   Step = VF*UF
> > >   BTC = PSE::BackedgeTakenCount()
> > >   N = BTC + 1                     // could overflow to 0 so do not compute N % Step
> > >   if (foldTail) N = N + (Step-1)  // for rounding up
> > >   R = BTC % Step                  // Fits foldTail: (N+Step-1)%Step == (BTC+1+Step-1)%Step == (BTC+Step)%Step == BTC%Step
> > >   if !(foldTail) { R = R + 1      // Fits requiresScalarEpilog: produces 0 < R <= Step 
> > >     if !(requiresScalarEpilog) R = (R == Step ? 0 : R) == R % Step}
> > >   VectorTripCount = N - R        
> > > 
> > > ```
> > > which could be optimized into
> > > 
> > > ```
> > >   Step = VF*UF
> > >   BTC = PSE::BackedgeTakenCount()
> > >   R = BTC % Step
> > >   VTC = BTC - R
> > >   if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
> > >   if (foldTail) VTC = VTC + Step
> > >   VectorTripCount = VTC
> > > ```
> > > This also allows foldTail to work with Steps (i.e., UF's) that are not a power of 2.
> > Hm, but doesn't it introduce additional instructions/compares that might lead to worse CodeGen?
> > I.e. in the common case, BTC will need to be computed, when `BTC - R`, when `select` from `VTC` and `VTC+Step`.
> > whereas currently it is just `VectorTripCount = TC - (TC % Step)`, where TC is already computed.
> Hopefully the code-size increase and the slowdown will be insignificant, being outside the loop, but indeed they need to be checked.
> 
> Another option is to use the fact that, although BTC+1 might overflow and wrap to zero, BTC-(Step-1) **may not underflow** if !foldTail thanks to the min.iters.check of TripCount = N = BTC+1 >= Step, and therefore (BTC+1)(w/o overflow) === (BTC-(Step-1)) modulo Step.
> So for a common case of !foldTail && !requiresScalarEpilog, VectorTripCount can be computed w/o risk of overflow or underflow using `N-((N-Step)%Step)` instead of the current `N - (N % Step)`.
> 
> For completeness, current VectorTripCount is computed by:
> 
> ```
>   Step = VF*UF
>   BTC = PSE::BackedgeTakenCount()
>   N = BTC + 1
>   if (foldTail) N = N + (Step-1)
>   R = N % Step
>   if (requiresScalarEpilog) R = (R == 0 ? Step : R)
>   VectorTripCount = N - R 
> ```
> 
I'm not sure what using `N-((N-Step)%Step)` instead of `N - (N % Step)` is supposed to solve.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75746/new/

https://reviews.llvm.org/D75746