[PATCH] D75746: [LoopVectorizer] Simplify branch in the remainder loop for trivial cases

Mon Apr 6 08:38:45 PDT 2020

danilaml marked an inline comment as done.
danilaml added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3107
+  // If VFxUF is 2 and vector loop is not skipped then remainder executes once.
+  if (VF * UF == 2 && !areSafetyChecksAdded()) {
+    if (BasicBlock *Latch = OrigLoop->getLoopLatch())
----------------
Ayal wrote:
> danilaml wrote:
> > Ayal wrote:
> > > danilaml wrote:
> > > > Ayal wrote:
> > > > > danilaml wrote:
> > > > > > I think this check is enough unless there are other cases in which "remainder loop has `N % (VF*UF)` iterations doesn't hold.
> > > > > Note that the trip count of the remainder loop may be **equal** to VF*UF, when loop `requiresScalarEpilogue()`; so in the above special case of VF*UF==2 remainder loop may iterate once **or twice**.
> > > > > 
> > > > > Note that `emitMinimumIterationCountCheck()` also takes care of the case where adding 1 to the backedge-taken count overflows, leading to an incorrect trip count of zero; here too "remainder" loop iterates (much) more than once.
> > > > Thanks. I knew I might've missed something. This makes me more skeptical about potential llvm.assume solution.
> > > > Am I understanding your note correctly, that adding requiresScalarEpilogue check is enough?
> > > Adding `requiresScalarEpilogue()` check is enough to handle the first case, but not the second case.
> > > One way to try and handle the second case is to change `getOrCreateVectorTripCount()` so that it relies on `PSE::getBackedgeTakenCount()` w/o adding 1 to it, as this addition (done by `getOrCreateTripCount()`) may overflow to zero.
> > > See r209854, and the `max_i32_backedgetaken()` test it added to test/Transforms/LoopVectorize/induction.ll.
> > > 
> > > Another way may be to check if/when adding 1 is known not to overflow.
> > I haven't figured out how to make VectorTripCount reliant solely on getBackedgeTakenCount without introducing extra checks/instructions (which seems to be the rationale behind the current code, instead of the one in r209854, at the expense of not vectorizing the rare UINT_MAX loops).
> > 
> > Instead, I've opted in checking whether the overflow might've occurred by using getConstantMaxBackedgeTakenCount. If I understood things correctly, it would return -1 (all ones) in the overflow case.
> A sketch of making VectorTripCount work correctly for loops with BTC= UINTMAX as well:
> 
> ```
>   Step = VF*UF
>   BTC = PSE::BackedgeTakenCount()
>   N = BTC + 1                     // could overflow to 0 so do not compute N % Step
>   if (foldTail) N = N + (Step-1)  // for rounding up
>   R = BTC % Step                  // Fits foldTail: (N+Step-1)%Step == (BTC+1+Step-1)%Step == (BTC+Step)%Step == BTC%Step
>   if !(foldTail) { R = R + 1      // Fits requiresScalarEpilog: produces 0 < R <= Step 
>     if !(requiresScalarEpilog) R = (R == Step ? 0 : R) == R % Step}
>   VectorTripCount = N - R        
> 
> ```
> which could be optimized into
> 
> ```
>   Step = VF*UF
>   BTC = PSE::BackedgeTakenCount()
>   R = BTC % Step
>   VTC = BTC - R
>   if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
>   if (foldTail) VTC = VTC + Step
>   VectorTripCount = VTC
> ```
> This also allows foldTail to work with Steps (i.e., UF's) that are not a power of 2.
Hm, but doesn't it introduce additional instructions/compares that might lead to worse CodeGen?
I.e. in the common case, BTC will need to be computed, when `BTC - R`, when `select` from `VTC` and `VTC+Step`.
whereas currently it is just `VectorTripCount = TC - (TC % Step)`, where TC is already computed.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75746/new/

https://reviews.llvm.org/D75746