[PATCH] D50480: [LV] Vectorizing loops of arbitrary trip count without remainder under opt for size

Tue Aug 14 18:33:07 PDT 2018

dcaballe added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:6355
+  if (Legal->blockNeedsPredication(OrigLoop->getHeader()))
+    CM.InterleaveInfo.reset();
+
----------------
Ayal wrote:
> dcaballe wrote:
> > Just curious. Could we prevent the computation or interleave groups for these cases instead of doing a reset?
> That would have been simpler indeed. But there's a subtle phase-ordering issue here: `MaxVF=computeFeasibleMaxVF()` uses tentative interleave groups to `getSmallestAndWidestTypes()`, and is then used in determining if the tail should be folded by masking (i.e., if TC is a multiple of MaxVF), in which case these groups will all be masked/invalid.
Thanks!

================
Comment at: lib/Transforms/Vectorize/VPlan.h:609
   /// VPlan opcodes, extending LLVM IR with idiomatics instructions.
-  enum { Not = Instruction::OtherOpsEnd + 1 };
+  enum { Not = Instruction::OtherOpsEnd + 1, ICmpULE };

----------------
hsaito wrote:
> Ayal wrote:
> > dcaballe wrote:
> > > I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first? 
> > An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil):
> > 
> > ```
> > +    // Introduce the early-exit compare IV <= BTC to form header block mask.
> > +    // This is used instead of IV < TC because TC may wrap, unlike BTC.
> > +    VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());
> > +    VPValue *BTC = Plan->getBackedgeTakenCount();
> > +    Value *Undef = UndefValue::get(Legal->getPrimaryInduction()->getType());
> > +    auto *ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef);
> > +    Plan->addDetachedValue(ICmp);
> > +    BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp);
> >      return BlockMaskCache[BB] = BlockMask;
> > ```
> > 
> > and then have `VPInstruction::generateInstruction()` do
> > 
> > ```
> > +  case Instruction::ICmp: {
> > +    Value *IV = State.get(getOperand(0), Part);
> > +    Value *TC = State.get(getOperand(1), Part);
> > +    auto *ICmp = cast<ICmpInst>(getUnderlyingValue());
> > +    Value *V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC);
> > +    State.set(this, V, Part);
> > +    break;
> > +  }
> > ```
> > 
> > where `VPlan::addDetachedValue()` is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to `UndefValue`'s.
> Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the subclassing in Phabricator?
> 
> The alternative by Ayal/Gil works only because the VPlan modeling is done very late in the vectorization process. That'll make it very hard to move the modeling towards the beginning of vectorization. Please don't do that.
> 
> My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing? 
Yes, I also feel that opening this door could be problematic in the long term. Let me see if I can quickly post the subclass in Phabricator so that we can see which changes are necessary in other places.

> My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing?

The closer the class hierarchies are, the easier will be.

Repository:
  rL LLVM

https://reviews.llvm.org/D50480