[PATCH] D50480: [LV] Vectorizing loops of arbitrary trip count without remainder under opt for size

Tue Aug 14 00:25:51 PDT 2018

Ayal added a comment.

In https://reviews.llvm.org/D50480#1196699, @dcaballe wrote:

> Do you see any potential issue that could make modeling this in the VPlan native path complicated once we have predication?

You should know better

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4965
+    LLVM_DEBUG(dbgs() << "LV: Aborting - trip count below given threshold for "
+                      << "loop with scalar iterations.\n");
     return None;
----------------
dcaballe wrote:
> I'm trying to understand the purpose of thsi check. Prevent masked vectorization if TC is lower than `TinyTripCountInterleaveThreshold` (i.e., 128)?. Should we use an independent threshold for this?
Ah, this is wrong, good catch!
The original purpose (of `TinyTripCountVectorThreshold` rather than `TinyTripCountInterleaveThreshold`) was to prevent vectorization of loops with very short trip counts due to overheads. Later it was extended in r306803 to allow vectorization under OptForSize, as it implies that all iterations are concentrated inside the vector loop for more accurate cost estimation. This still holds when folding the tail by masking, so we should not bail out here.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5218
+  // variable of the vector loop wraps to zero, when tail is folded by masking;
+  // this currently happens when OptForSize, inwhich case IC is set to 1 above.
   unsigned IC = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
----------------
dcaballe wrote:
> inwhich -> in which?
ok

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:6355
+  if (Legal->blockNeedsPredication(OrigLoop->getHeader()))
+    CM.InterleaveInfo.reset();
+
----------------
dcaballe wrote:
> Just curious. Could we prevent the computation or interleave groups for these cases instead of doing a reset?
That would have been simpler indeed. But there's a subtle phase-ordering issue here: `MaxVF=computeFeasibleMaxVF()` uses tentative interleave groups to `getSmallestAndWidestTypes()`, and is then used in determining if the tail should be folded by masking (i.e., if TC is a multiple of MaxVF), in which case these groups will all be masked/invalid.

================
Comment at: lib/Transforms/Vectorize/VPlan.h:609
   /// VPlan opcodes, extending LLVM IR with idiomatics instructions.
-  enum { Not = Instruction::OtherOpsEnd + 1 };
+  enum { Not = Instruction::OtherOpsEnd + 1, ICmpULE };

----------------
dcaballe wrote:
> I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first? 
An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil):

```
+    // Introduce the early-exit compare IV <= BTC to form header block mask.
+    // This is used instead of IV < TC because TC may wrap, unlike BTC.
+    VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());
+    VPValue *BTC = Plan->getBackedgeTakenCount();
+    Value *Undef = UndefValue::get(Legal->getPrimaryInduction()->getType());
+    auto *ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef);
+    Plan->addDetachedValue(ICmp);
+    BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp);
     return BlockMaskCache[BB] = BlockMask;
```

and then have `VPInstruction::generateInstruction()` do

```
+  case Instruction::ICmp: {
+    Value *IV = State.get(getOperand(0), Part);
+    Value *TC = State.get(getOperand(1), Part);
+    auto *ICmp = cast<ICmpInst>(getUnderlyingValue());
+    Value *V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC);
+    State.set(this, V, Part);
+    break;
+  }
```

where `VPlan::addDetachedValue()` is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to `UndefValue`'s.

================
Comment at: lib/Transforms/Vectorize/VPlan.h:1126
+  VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {
+    BackedgeTakenCount = new VPValue();
+  }
----------------
dcaballe wrote:
> Instead of using an "empty" VPValue to model the BTC, would it be possible to model the actual operations to compute the BTC? We would only need a sub, right?
The BTC is computed by subtracting 1 from the Trip Count, which in turn is generated by SCEVExpander. To model this decrement would require using an "empty" VPValue to model its Trip Count operand. In any case, both involve scalar instructions that take place before the vectorized loop, currently outside the VPlan'd zone.

Repository:
  rL LLVM

https://reviews.llvm.org/D50480