[llvm-commits] [PATCH] LoopVectorizer

Wed Oct 17 18:24:01 PDT 2012

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Ralf Karrenberg" <karrenberg at cdl.uni-saarland.de>
> Cc: "llvm-commits at cs.uiuc.edu LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Wednesday, October 17, 2012 11:21:49 AM
> Subject: Re: [llvm-commits] [PATCH] LoopVectorizer
> 
> 
> On Oct 17, 2012, at 1:59 AM, Ralf Karrenberg
> <karrenberg at cdl.uni-saarland.de> wrote:
> 
> > Hi,
> > 
> > Thanks Nadav, this is a great start for the loop vectorization
> > implementation!
> > I haven't done a code review for LLVM so far, so I'll be happy to
> > get some feedback for the feedback as well :).
> > 
> > On 10/17/12 8:30 AM, Hal Finkel wrote:
> >>> +    //Check that all of the users of the loop are inside the BB.
> >>> +    for (Value::use_iterator it = I->use_begin(), e =
> >>> I->use_end();
> >>> +         it != e; ++it) {
> >>> +      Instruction *U = cast<Instruction>(*it);
> >>> +      BasicBlock *Parent = U->getParent();
> >>> +      if (Parent != &BB) {
> >>> +        DEBUG(dbgs() << "LV: Found an outside user for : "<< *U
> >>> << "\n");
> >>> +        return false;
> >>> +      }
> >>> +    }
> >> It seems like handling this case requires just as much code as
> >> this check. If there is an outside user, you just need to extract
> >> the last vector element and call replaceAllUsesWith(), right?
> > I may be mistaken, but I think that would require introducing
> > additional blocks:
> > You cannot put an extractelement operation behind the loop since if
> > the loop body is not executed, you have a value that is not a
> > vector. You also don't want to put it inside the loop because then
> > the extract would be executed in every iteration although it is
> > only required once. That would leave you the option of creating a
> > branch around the loop and an additional block between the loop
> > exit and the join block (which holds a phi for each of those
> > values).
> > 
> 
> I think that this is incorrect.  The in-loop value is a vector while
> the outside user is a scalar.  Think of variable "sum" which adds
> values from the loop. We can't just pick the last element.  In the
> case of reduction variables (which we need to address at some point)
> we can just redo the ops on all of the vector elements.

The current pass handles only fixed trip-count (single-exit) loops without inter-iteration dependencies, correct? For this class of loops, I believe that what I proposed is correct. Maybe I'm thinking about this incorrectly, can you please provide a counterexample?

Thanks again,
Hal

> 
> > Other comments:
> > 
> > +  /// Create a broadcast instruction. This method generates a
> > broadcast
> > +  /// instruction (shuffle) for loop invariant values and for the
> > induction
> > +  /// value. If this is the induction variable then we extend it
> > to N, N+2, ...
> > +  /// this is needed because each iteration in the loop
> > corresponds to a SIMD
> > +  /// element.
> > +  Value *getBroadcastInstrs(Value *V);
> > 
> > Shouldn't it be <N, N+1, ... N+W-1> where W is the vectorization
> > factor?
> 
> Yes. Fixed the comment.
> 
> > +  /// Returns the maximum vectorization factor that we *can* use
> > to vectorize
> > +  /// this loop. This does not mean that it is profitable to
> > vectorize this
> > +  /// loop, only that it is legal to do so. This may be a large
> > number. We
> > +  /// can vectorize to any SIMD width below this number.
> > +  unsigned getLoopMaxVF();
> > 
> > Denote that the number can only be a power of two.
> > 
> > +  DataLayout *TD;
> 
> Renamed.
> 
> > 
> > Should probably be "DL" now since it is not called "TargetData"
> > anymore ;).
> > 
> > getBroadcastInstrs()
> > Do you handle constants with this method as well? Is LLVM's
> > constant propagation pass powerful enough to create vector
> > constants from the shuffles? Otherwise, I would have some code
> > that creates vector constants from scalars directly.
> 
> This handles constants.  In any case, this is hoisted outside of the
> loop.
> 
> > 
> > +Value *SingleBlockLoopVectorizer::getConsecutiveVector(Value* Val)
> > {
> > +  assert(Val->getType()->isVectorTy() && "Must be a vector");
> > +  assert(Val->getType()->getScalarType()->isIntegerTy() &&
> > +         "Elem must be an integer");
> > +  Type *ITy = Val->getType()->getScalarType();
> > +  VectorType *Ty = cast<VectorType>(Val->getType());
> > +  unsigned VF = Ty->getNumElements();
> > 
> > VF is already a member variable. Remove the last line or convert it
> > into an assertion.
> 
> Done.
> 
> > Minor remark: I think getConsecutiveVector() is only required to
> > run once, so the resulting constant could be created in the
> > constructor already unless you don't want more member variables.
> > Also, the widening of the induction variable should probably only
> > be done once, so you may want to add another member variable for
> > the widened one or add it to the WidenMap.
> > 
> 
> I agree. I will fix this later.
> 
> > +  for (unsigned op = 0, e = Instr->getNumOperands(); op != e;
> > ++op) {
> > 
> > Is there a reason not to use op_begin()/op_end()?
> 
> I think that it is easier to access the last index this way.
> 
> > 
> > ScalarizeInstruction()
> > You create new extract operations every time an instruction is used
> > as an operand to a non-vectorizable instruction. Can we rely on
> > subsequent optimization phases to remove this redundancy again?
> > 
> 
> This is okay. Because of two optimizations:
> 
> 1. dead code removes unused instructions.
> 
> 2. extract from a vector which is build from a sequence of
> insert-elements is optimized to using the original elements.
> 
> So, even if we have a sequence of scalar instructions, we won't see
> the accordion effect where we open-and-close vectors.
> 
> 
> > More importantly however, you create extract/insert operations for
> > every non-vectorizable instruction independently. At some point,
> > we probably need some heuristic that analyzes chains of
> > instructions: in cases like alternating vectorizable and
> > non-vectorizable instructions it may be beneficial to scalarize
> > instructions that are vectorizable to reduce pack/unpack costs
> > [1].
> 
> We need a cost model, but this is independent of the optimizations i
> mentioned above.
> 
> > 
> > +  // Get the trip count from the BE count by adding 1.
> > +  ExitCount = SE->getAddExpr(ExitCount,
> > + SE->getConstant(ExitCount->getType(), 1));
> > 
> > Bad alignment on the second argument. "BE" is a typo I think, it
> > does not appear anywhere else.
> > 
> > case Instruction::Select: {
> > +        // Widen selects.
> > +        Value *A = getVectorValue(Inst->getOperand(0));
> > +        Value *B = getVectorValue(Inst->getOperand(1));
> > +        Value *C = getVectorValue(Inst->getOperand(2));
> > +        WidenMap[Inst] = Builder->CreateSelect(A, B, C);
> > +        break;
> > +      }
> 
> 
> 
> > 
> > If the condition of a select is uniform we don't want to widen it,
> > so I would suggest a special case here instead of calling
> > getBroadcastInstrs() via getVectorValue() if there is no mapping
> > in WidenMap.
> > 
> 
> Yes. We need to check if the selector is uniform.   This is a good
> optimization.
> 
> 
> > +      case Instruction::Store: {
> > +      case Instruction::Load: {
> > 
> > What about loads/stores with entirely uniform indices?
> 
> Also a good optimization. Will do.
> 
> > We could simply insert a single scalar operation there instead of
> > cloning them to do W times the same thing.
> > 
> > +  // Check that the underlying objects of the reads are writes are
> > either
> > 
> > Typo, should be "reads and writes".
> 
> Fixed.
> 
> > 
> > 
> > Otherwise, the code looks pretty good to me. :)
> > I did not take a close look at the tests, though.
> > 
> 
> Thanks for the review ralf!
> 
> 
> > Cheers,
> > Ralf
> > 
> > [1] "Efficient SIMD Code Generation for Irregular Kernels", Kim and
> > Han, PPoPP'12
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory