[PATCH] D50480: [LV] Vectorizing loops of arbitrary trip count without remainder under opt for size

Thu Aug 16 12:37:49 PDT 2018

hsaito added inline comments.

================
Comment at: include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h:485
   SmallPtrSet<const Instruction *, 8> MaskedOp;
+
+  /// All blocks of loop are to be masked to fold tail of scalar iterations.
----------------
I think it's best not to keep this state in the Legal. From the Legal perspective, being able to vectorize the whole loop body under the mask and the actual decision to do so are completely separate issues. 

Since canFold...() is invoked by CostModel::computeMaxVF, we should be able to keep this state in the CostModel. After all, whether to bail out or continue under FoldTailByMasking is a cost model side of the state, after consulting the Legal.

================
Comment at: lib/Transforms/Vectorize/LoopVectorizationLegality.cpp:792

 bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) {
+  return (FoldTailByMasking ||
----------------
By moving FoldTail state to CostModel,  we can define CostModel::blockNeedsPredication(BB) as FoldTailByMasking || LAI::blockNeedsPredication(BB) and make Legal version static to Legal.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2673
   // unroll factor (number of SIMD instructions).
-  Constant *Step = ConstantInt::get(TC->getType(), VF * UF);
   Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");

----------------
Ayal wrote:
> hsaito wrote:
> > Ayal wrote:
> > > hsaito wrote:
> > > > This Urem creation should be skipped if we aren't generating remainder.
> > > This Urem is also used to round N up to a multiple of Step, i.e., when we're not generating remainder.
> > Ouch. Well, given the assertion for VF*UF being power of two (constant), the UREM and other computation should be reasonably optimizable downstream. So, it's probably unfair to ask you to fix the trip count computation ---- so, I won't ask. There is a trade off between generating more optimal output IR and the cost of maintaining the code to do that. Keeping UREM here is opting for lower maintenance. Just for the record.
> Rounding N down to a multiple of Step is in general N-(N%Step). If Step is a constant multiple of two (which is currently always the case, and must be the case when folding the tail by masking), it gets optimized downstream to N&(-Step). If Step would be some other constant it may get optimized downstream to use multiplication instead of division, depending on target characteristics. In any case, this takes place before the loop; and is orthogonal to this patch, which simply reuses the existing logic to also round up.
>orthogonal to this patch

I agree.

Repository:
  rL LLVM

https://reviews.llvm.org/D50480