[PATCH] D50480: [LV] Vectorizing loops of arbitrary trip count without remainder under opt for size

Mon Aug 20 16:34:37 PDT 2018

hsaito added a comment.

For me, the only major issue left is the detached IR instruction. @dcaballe, please try adding the reviewers/subscribers of https://reviews.llvm.org/D50480 to https://reviews.llvm.org/D50823, in the hopes of getting a quicker resolution there, so as not to block https://reviews.llvm.org/D50480 because of that. I will not oppose to https://reviews.llvm.org/D50480 for introducing new ULE opcode of VPInstruction (design/implementation choice within VPlan concept), but I will strongly oppose for the use of detahced IR instruction (goes against VPlan concept).

It's certainly nicer if @Ayal, @dcaballe, and others can agree on VPCmpInst or not quickly enough. I vote in favor of VPCmpInst.

Thanks,
Hideki

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4948

-  // If we don't know the precise trip count, don't try to vectorize.
-  if (TC < 2) {
-    ORE->emit(
-        createMissedAnalysis("UnknownLoopCountComplexCFG")
-        << "unable to calculate the loop count due to complex control flow");
-    LLVM_DEBUG(
-        dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
+  if (TC == 1) {
+    LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");
----------------
Ayal wrote:
> hsaito wrote:
> > reames wrote:
> > > There's a mix of seemingly unrelated changes here.  This is one example.  It would be good to land these separately.  
> > This change is relevant in the sense that TC < 2 is split into two parts: TC==1 and TC==0. TC==0 case will then have a chance of hitting Legal->canFoldTailByMasking() later. As a result, TC==1 case can return early here, with a very crisp messaging. 
> > 
> > Having said that, if you'd like to see the same ORE->emit(...) LLVM_DEBUG() stuff here, I won't go against that. Messaging change can be a separate commit.
> > 
> > Ayal, we need ORE->emit() here, in addition to LLVM_DEBUG(), right, regardless of whether we change the actual message?
> Yes, this change is unrelated and should land separately. The original ORE message is wrong. Not sure the TC==1 qualifies for any ORE message - "loops" with a known trip count of one are simply irrelevant for vectorization; though we could vectorize them with a mask...
I think every non-vectorized loop that goes through vectorizer's analysis qualifies for ORE. After all, TC==1 knowledge may or may not be available to the programmer otherwise. 

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4977
+    return MaxVF;
+
+  return None;
----------------
Ayal wrote:
> hsaito wrote:
> > I think we need to add
> > 
> >      if (TC==0) { emit one kind of remark }
> >      else { emit another kind of remark }
> > 
> > here ---- in order to match previous capability.
> > 
> > 
> OK, will retain the previous MissedAnalysis remarks here, in addition to the new ones supplied by `canFoldTailByMasking()`.
Thank you.

================
Comment at: lib/Transforms/Vectorize/VPlan.h:609
   /// VPlan opcodes, extending LLVM IR with idiomatics instructions.
-  enum { Not = Instruction::OtherOpsEnd + 1 };
+  enum { Not = Instruction::OtherOpsEnd + 1, ICmpULE };

----------------
dcaballe wrote:
> Ayal wrote:
> > dcaballe wrote:
> > > hsaito wrote:
> > > > Ayal wrote:
> > > > > dcaballe wrote:
> > > > > > hsaito wrote:
> > > > > > > Ayal wrote:
> > > > > > > > dcaballe wrote:
> > > > > > > > > I'm worried that this new opcode could be problematic since now we can have compare instructions represented as VPInstructions with Instruction::ICmp and Instruction::FCmp opcodes and VPInstructions with VPInstruction::ICmpULE. Internally, we have a VPCmpInst subclass to model I/FCmp opcodes and their predicates. Do you think it would be better to upstream that subclass first? 
> > > > > > > > An alternative of leveraging `Instruction::ICmp` opcode and existing `ICmpInst` subclasses for keeping the Predicate, in a scalable way, could be (devised jointly w/ Gil):
> > > > > > > > 
> > > > > > > > ```
> > > > > > > > +    // Introduce the early-exit compare IV <= BTC to form header block mask.
> > > > > > > > +    // This is used instead of IV < TC because TC may wrap, unlike BTC.
> > > > > > > > +    VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());
> > > > > > > > +    VPValue *BTC = Plan->getBackedgeTakenCount();
> > > > > > > > +    Value *Undef = UndefValue::get(Legal->getPrimaryInduction()->getType());
> > > > > > > > +    auto *ICmp = new ICmpInst(ICmpInst::ICMP_ULE, Undef, Undef);
> > > > > > > > +    Plan->addDetachedValue(ICmp);
> > > > > > > > +    BlockMask = Builder.createNaryOp(Instruction::ICmp, {IV, BTC}, ICmp);
> > > > > > > >      return BlockMaskCache[BB] = BlockMask;
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > and then have `VPInstruction::generateInstruction()` do
> > > > > > > > 
> > > > > > > > ```
> > > > > > > > +  case Instruction::ICmp: {
> > > > > > > > +    Value *IV = State.get(getOperand(0), Part);
> > > > > > > > +    Value *TC = State.get(getOperand(1), Part);
> > > > > > > > +    auto *ICmp = cast<ICmpInst>(getUnderlyingValue());
> > > > > > > > +    Value *V = Builder.CreateICmp(ICmp->getPredicate(), IV, TC);
> > > > > > > > +    State.set(this, V, Part);
> > > > > > > > +    break;
> > > > > > > > +  }
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > where `VPlan::addDetachedValue()` is used for disposal purposes only. This has a minor (acceptable?) impact on the underlying IR: it creates/adds-users to `UndefValue`'s.
> > > > > > > Pros/cons are easier to discuss with the code in hand. Diego, would you be able to upload the subclassing in Phabricator?
> > > > > > > 
> > > > > > > The alternative by Ayal/Gil works only because the VPlan modeling is done very late in the vectorization process. That'll make it very hard to move the modeling towards the beginning of vectorization. Please don't do that.
> > > > > > > 
> > > > > > > My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing? 
> > > > > > Yes, I also feel that opening this door could be problematic in the long term. Let me see if I can quickly post the subclass in Phabricator so that we can see which changes are necessary in other places.
> > > > > > 
> > > > > > > My preference is to be able to templatize VPInstruction and Instruction as much as feasible. Is that easier with subclassing?
> > > > > > 
> > > > > > The closer the class hierarchies are, the easier will be.
> > > > > Extensions of VPInstructions such as VPCmpInst should indeed be uploaded for review and deserve a separate discussion thread and justification. This patch could tentatively make use of it, though for the purpose of this patch an ICmpULE opcode or a detached ICmpInst suffice. An ICmpULE opcode shouldn't be problematic **currently**, as this early-exit is the only VPInstruction compare with a Predicate, right? Note that detached UnderlyingValues could serve as **data containers** for all fields already implemented in the IR hierarchy, and could be constructed at any point of VPlan construction for that purpose. Extending VPInstructions to provide a similar **API** as that of IR Instructions seems to be an orthogonal concern with its own design objectives, and can coexist with detached Values; e.g., a VPCmpInst could hold its Predicate using a detached ICmpInst/FCmpInst.
> > > > I go against detached ICmpInst. We'll be moving VPlan modeling before the cost model and creating an IR Instruction before deciding to vectorize is against the VPlan concept.
> > > > 
> > > > >seems to be an orthogonal concern with its own design objectives
> > > > 
> > > > Not quite. We'd like VPInstruction as easy to use to many LLVM developers and that is an integral part of design/implementation from the beginning.
> > > > 
> > > > Having said that, new opcode versus VPCmpInst doesn't block the rest of the review. Other parts of the review should proceed while opcode versus VPCmpInst discussion is in progress on the side.
> > > I created D50823 with the VPCmpInst sub-class so that we can make a decision with the code in place.
> > VPlans should indeed keep the existing IR intact w/o changing it, as they are tentative by design, and also by current implementation. But creating a detached IR Instruction, just for the purpose of holding its attributes, w/o connecting it to any User, Operand (except Undef's) or BasicBlock, is arguably keeping the existing IR intact.  Doing so should be quite familiar to LLVM developers, avoids mirroring Instruction's class hierarchy or a subset thereof, and leverages the existing UnderlyingValue pointer that is unutilized by InnerLoopVectorizer. Next uploaded version provides this complete option.
> > 
> > Having said that, this patch can surely work with a VP(I)CmpInst just as well, as it merely needs a way for a single compare VPInstruction to hold a single Predicate, and print its name.
> I understand your point, Ayal. However, using UnderlyingValue as a pointer to the actual input IR in the VPlan native path and as a pointer to a detached IR Value in the inner loop path is very likely to be problematic, even in the short term. We would have to special case the code that is shared for both paths to treat the UnderlyingValue differently. The detached IR special semantics in the inner loop path would also make a bit more complicated the convergence of both paths. If there are no major concerns regarding the VPCmpInst, I'd prefer going with that approach.
> 
Detached IR instruction is detrimental to VPlan direction. Please do not use it.

================
Comment at: test/Transforms/LoopVectorize/X86/optsize.ll:4
+; attributes.
+; RUN: opt < %s -loop-vectorize -S -mtriple=x86_64-apple-darwin -mcpu=skx | FileCheck %s
+
----------------
Is the test really dependent on the apple triple?

https://reviews.llvm.org/D50480