[PATCH] D50480: [LV] Vectorizing loops of arbitrary trip count without remainder under opt for size

Wed Aug 15 12:54:35 PDT 2018

Ayal added a comment.

In https://reviews.llvm.org/D50480#1199900, @reames wrote:

> I have a general question about direction, not specific to this patch.
>
> It seems like we're adding a specific form of predication to the vectorizer in this patch and I know we already have support for various predicated load and store idioms. What are our plans in terms of supporting more general predication? For instance, I don't believe we handle loops like the following at the moment:
>
>   for (int i = 0; i < N; i++) {
>    if (unlikely(i > M)) 
>       break;
>    sum += a[i];
>   }
>   
>
> Can the infrastructure in this patch be generalized to handle such cases?  And if so, are their any specific plans to do so?

Good question! Replacing the `break` with a `continue` vectorizes just fine and produces the same result, albeit spinning uselessly for the last N-M iterations. Dealing with such "breaks" directly deserves more thought :-). In general it's probably better to fold such two upper bounds into one = min(N,M+1), producing a countable unpredicated loop. This is a known optimization for OpenCL1.x kernels, often guarded with "if (get_global_id(0) > M) continue;" due to work_group size constraints, when compiled for CPU.

> Secondly, are there any plans to enable this approach for anything other than optsize?

We could, for example, consider enabling it under -O2 for loops whose entire (or nearly entire) body is already conditional; e.g.,

  for (int i = 0; i < N; i++) {
    if (i*i % 4 != 2) {
      <loop body>
    }
  }

otherwise the overhead of predicating code that could otherwise run unpredicated may be detrimental.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2673
   // unroll factor (number of SIMD instructions).
-  Constant *Step = ConstantInt::get(TC->getType(), VF * UF);
   Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");

----------------
hsaito wrote:
> Ayal wrote:
> > hsaito wrote:
> > > This Urem creation should be skipped if we aren't generating remainder.
> > This Urem is also used to round N up to a multiple of Step, i.e., when we're not generating remainder.
> Ouch. Well, given the assertion for VF*UF being power of two (constant), the UREM and other computation should be reasonably optimizable downstream. So, it's probably unfair to ask you to fix the trip count computation ---- so, I won't ask. There is a trade off between generating more optimal output IR and the cost of maintaining the code to do that. Keeping UREM here is opting for lower maintenance. Just for the record.
Rounding N down to a multiple of Step is in general N-(N%Step). If Step is a constant multiple of two (which is currently always the case, and must be the case when folding the tail by masking), it gets optimized downstream to N&(-Step). If Step would be some other constant it may get optimized downstream to use multiplication instead of division, depending on target characteristics. In any case, this takes place before the loop; and is orthogonal to this patch, which simply reuses the existing logic to also round up.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4948

-  // If we don't know the precise trip count, don't try to vectorize.
-  if (TC < 2) {
-    ORE->emit(
-        createMissedAnalysis("UnknownLoopCountComplexCFG")
-        << "unable to calculate the loop count due to complex control flow");
-    LLVM_DEBUG(
-        dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
+  if (TC == 1) {
+    LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");
----------------
hsaito wrote:
> reames wrote:
> > There's a mix of seemingly unrelated changes here.  This is one example.  It would be good to land these separately.  
> This change is relevant in the sense that TC < 2 is split into two parts: TC==1 and TC==0. TC==0 case will then have a chance of hitting Legal->canFoldTailByMasking() later. As a result, TC==1 case can return early here, with a very crisp messaging. 
> 
> Having said that, if you'd like to see the same ORE->emit(...) LLVM_DEBUG() stuff here, I won't go against that. Messaging change can be a separate commit.
> 
> Ayal, we need ORE->emit() here, in addition to LLVM_DEBUG(), right, regardless of whether we change the actual message?
Yes, this change is unrelated and should land separately. The original ORE message is wrong. Not sure the TC==1 qualifies for any ORE message - "loops" with a known trip count of one are simply irrelevant for vectorization; though we could vectorize them with a mask...

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4965
+    LLVM_DEBUG(dbgs() << "LV: Aborting - trip count below given threshold for "
+                      << "loop with scalar iterations.\n");
     return None;
----------------
Ayal wrote:
> dcaballe wrote:
> > I'm trying to understand the purpose of thsi check. Prevent masked vectorization if TC is lower than `TinyTripCountInterleaveThreshold` (i.e., 128)?. Should we use an independent threshold for this?
> Ah, this is wrong, good catch!
> The original purpose (of `TinyTripCountVectorThreshold` rather than `TinyTripCountInterleaveThreshold`) was to prevent vectorization of loops with very short trip counts due to overheads. Later it was extended in r306803 to allow vectorization under OptForSize, as it implies that all iterations are concentrated inside the vector loop for more accurate cost estimation. This still holds when folding the tail by masking, so we should not bail out here.
This BTW is caught by vect.**omp**.force.small-tc.ll; but the -vectorizer-min-trip-count=21 flag it uses is external to OpenMP, afaik.

================
Comment at: lib/Transforms/Vectorize/VPlan.h:609
   /// VPlan opcodes, extending LLVM IR with idiomatics instructions.
-  enum { Not = Instruction::OtherOpsEnd + 1 };
+  enum { Not = Instruction::OtherOpsEnd + 1, ICmpULE };

----------------
dcaballe wrote:
> hsaito wrote:
> > Ayal wrote:
> > > dcaballe wrote:
> > > > I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first? 
> > > An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil):
> > > 
> > > ```
> > > +    // Introduce the early-exit compare IV <= BTC to form header block mask.
> > > +    // This is used instead of IV < TC because TC may wrap, unlike BTC.
> > > +    VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());
> > > +    VPValue *BTC = Plan->getBackedgeTakenCount();
> > > +    Value *Undef = UndefValue::get(Legal->getPrimaryInduction()->getType());
> > > +    auto *ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef);
> > > +    Plan->addDetachedValue(ICmp);
> > > +    BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp);
> > >      return BlockMaskCache[BB] = BlockMask;
> > > ```
> > > 
> > > and then have `VPInstruction::generateInstruction()` do
> > > 
> > > ```
> > > +  case Instruction::ICmp: {
> > > +    Value *IV = State.get(getOperand(0), Part);
> > > +    Value *TC = State.get(getOperand(1), Part);
> > > +    auto *ICmp = cast<ICmpInst>(getUnderlyingValue());
> > > +    Value *V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC);
> > > +    State.set(this, V, Part);
> > > +    break;
> > > +  }
> > > ```
> > > 
> > > where `VPlan::addDetachedValue()` is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to `UndefValue`'s.
> > Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the subclassing in Phabricator?
> > 
> > The alternative by Ayal/Gil works only because the VPlan modeling is done very late in the vectorization process. That'll make it very hard to move the modeling towards the beginning of vectorization. Please don't do that.
> > 
> > My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing? 
> Yes, I also feel that opening this door could be problematic in the long term. Let me see if I can quickly post the subclass in Phabricator so that we can see which changes are necessary in other places.
> 
> > My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing?
> 
> The closer the class hierarchies are, the easier will be.
Extensions of VPInstructions such as VPCmpInst should indeed be uploaded for review and deserve a separate discussion thread and justification. This patch could tentatively make use of it, though for the purpose of this patch an ICmpULE opcode or a detached ICmpInst suffice. An ICmpULE opcode shouldn't be problematic **currently**, as this early-exit is the only VPInstruction compare with a Predicate, right? Note that detached UnderlyingValues could serve as **data containers** for all fields already implemented in the IR hierarchy, and could be constructed at any point of VPlan construction for that purpose. Extending VPInstructions to provide a similar **API** as that of IR Instructions seems to be an orthogonal concern with its own design objectives, and can coexist with detached Values; e.g., a VPCmpInst could hold its Predicate using a detached ICmpInst/FCmpInst.

================
Comment at: lib/Transforms/Vectorize/VPlan.h:1126
+  VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {
+    BackedgeTakenCount = new VPValue();
+  }
----------------
hsaito wrote:
> Ayal wrote:
> > dcaballe wrote:
> > > Instead of using an "empty" VPValue to model the BTC, would it be possible to model the actual operations to compute the BTC? We would only need a sub, right?
> > The BTC is computed by subtracting 1 from the Trip Count, which in turn is generated by SCEVExpander. To model this decrement would require using an "empty" VPValue to model its Trip Count operand. In any case, both involve scalar instructions that take place before the vectorized loop, currently outside the VPlan'd zone.
> I'm not a big fan of allocating memory that goes unused in many situations. We can initialize this to nullptr, and create an instance once we know BTC is needed. That'll lose the convenience of being able to check NumUsers, but creating needsBackedgeTakenCount() member function shouldn't be that bad. It's just Legal->foldTailByMasking(), until something else needs BTC, right? 
> 
OK. The VPValue can be created on demand, turning `getBackedgeTakenCount()` into `getOrCreateBackedgeTakenCount()`. `NumUsers` should still be checked, as this isolates the decision of creating the IR based on the VPlan.
In any case, VPlan in general is a tentative construct, destined for destruction w/o being materialized except for the BestPlan, if at all. So holding one VPValue for the BTC, which is always well defined but possibly not always used, seems insignificant.

================
Comment at: test/Transforms/LoopVectorize/X86/optsize.ll:12
+; CHECK-LABEL: @foo_optsize(
+; CHECK: x i8>
+
----------------
reames wrote:
> Testing wise, expanding out the IR generated w/update-lit-checks and landing the tests without the changes and then rebasing on top would make it much easier to follow the transform being described for those us not already expert in the vectorizer code structures.  I get that your following existing practice, but this might be one of the cases which justify changing existing practice in the area.  :)
Agreed. The original target-independent version of optsize.ll still passes, BTW, (i.e., fails to vectorize), but due to cost-model considerations rather than scalar tail considerations.

Repository:
  rL LLVM

https://reviews.llvm.org/D50480