[llvm-commits] [PATCH] LoopVectorizer

Wed Oct 17 08:10:12 PDT 2012

----- Original Message -----
> From: "Ralf Karrenberg" <karrenberg at cdl.uni-saarland.de>
> To: "Nadav Rotem" <nrotem at apple.com>
> Cc: "llvm-commits at cs.uiuc.edu LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Wednesday, October 17, 2012 3:59:11 AM
> Subject: Re: [llvm-commits] [PATCH] LoopVectorizer
> 
> Hi,
> 
> Thanks Nadav, this is a great start for the loop vectorization
> implementation!
> I haven't done a code review for LLVM so far,
> so I'll be happy to get
> some feedback for the feedback as well :).

I think that you've done a good job :)

> 
> On 10/17/12 8:30 AM, Hal Finkel wrote:
> >> +    //Check that all of the users of the loop are inside the BB.
> >> +    for (Value::use_iterator it = I->use_begin(), e =
> >> I->use_end();
> >> +         it != e; ++it) {
> >> +      Instruction *U = cast<Instruction>(*it);
> >> +      BasicBlock *Parent = U->getParent();
> >> +      if (Parent != &BB) {
> >> +        DEBUG(dbgs() << "LV: Found an outside user for : "<< *U
> >> << "\n");
> >> +        return false;
> >> +      }
> >> +    }
> > It seems like handling this case requires just as much code as this
> > check. If there is an outside user, you just need to extract the
> > last vector element and call replaceAllUsesWith(), right?
> I may be mistaken, but I think that would require introducing
> additional
> blocks:
> You cannot put an extractelement operation behind the loop since if
> the
> loop body is not executed, you have a value that is not a vector. You
> also don't want to put it inside the loop because then the extract
> would
> be executed in every iteration although it is only required once.

But putting it inside the loop is exactly what I had in mind. As you point out, this can be improved, but at least we'd be able to handle the loops instead of bailing on them (and the change seems very simple). I suspect that even with the extra extract, you'd still see large, although not optimal, speedups.

 -Hal

> That
> would leave you the option of creating a branch around the loop and
> an
> additional block between the loop exit and the join block (which
> holds a
> phi for each of those values).
> 
> Other comments:
> 
> +  /// Create a broadcast instruction. This method generates a
> broadcast
> +  /// instruction (shuffle) for loop invariant values and for the
> induction
> +  /// value. If this is the induction variable then we extend it to
> N,
> N+2, ...
> +  /// this is needed because each iteration in the loop corresponds
> to
> a SIMD
> +  /// element.
> +  Value *getBroadcastInstrs(Value *V);
> 
> Shouldn't it be <N, N+1, ... N+W-1> where W is the vectorization
> factor?
> 
> +  /// Returns the maximum vectorization factor that we *can* use to
> vectorize
> +  /// this loop. This does not mean that it is profitable to
> vectorize this
> +  /// loop, only that it is legal to do so. This may be a large
> number. We
> +  /// can vectorize to any SIMD width below this number.
> +  unsigned getLoopMaxVF();
> 
> Denote that the number can only be a power of two.
> 
> +  DataLayout *TD;
> 
> Should probably be "DL" now since it is not called "TargetData"
> anymore ;).
> 
> getBroadcastInstrs()
> Do you handle constants with this method as well? Is LLVM's constant
> propagation pass powerful enough to create vector constants from the
> shuffles? Otherwise, I would have some code that creates vector
> constants from scalars directly.
> 
> +Value *SingleBlockLoopVectorizer::getConsecutiveVector(Value* Val) {
> +  assert(Val->getType()->isVectorTy() && "Must be a vector");
> +  assert(Val->getType()->getScalarType()->isIntegerTy() &&
> +         "Elem must be an integer");
> +  Type *ITy = Val->getType()->getScalarType();
> +  VectorType *Ty = cast<VectorType>(Val->getType());
> +  unsigned VF = Ty->getNumElements();
> 
> VF is already a member variable. Remove the last line or convert it
> into
> an assertion.
> Minor remark: I think getConsecutiveVector() is only required to run
> once, so the resulting constant could be created in the constructor
> already unless you don't want more member variables.
> Also, the widening of the induction variable should probably only be
> done once, so you may want to add another member variable for the
> widened one or add it to the WidenMap.
> 
> +  for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op)
> {
> 
> Is there a reason not to use op_begin()/op_end()?
> 
> ScalarizeInstruction()
> You create new extract operations every time an instruction is used
> as
> an operand to a non-vectorizable instruction. Can we rely on
> subsequent
> optimization phases to remove this redundancy again?
> 
> More importantly however, you create extract/insert operations for
> every
> non-vectorizable instruction independently. At some point, we
> probably
> need some heuristic that analyzes chains of instructions: in cases
> like
> alternating vectorizable and non-vectorizable instructions it may be
> beneficial to scalarize instructions that are vectorizable to reduce
> pack/unpack costs [1].
> 
> +  // Get the trip count from the BE count by adding 1.
> +  ExitCount = SE->getAddExpr(ExitCount,
> + SE->getConstant(ExitCount->getType(), 1));
> 
> Bad alignment on the second argument. "BE" is a typo I think, it does
> not appear anywhere else.
> 
> case Instruction::Select: {
> +        // Widen selects.
> +        Value *A = getVectorValue(Inst->getOperand(0));
> +        Value *B = getVectorValue(Inst->getOperand(1));
> +        Value *C = getVectorValue(Inst->getOperand(2));
> +        WidenMap[Inst] = Builder->CreateSelect(A, B, C);
> +        break;
> +      }
> 
> If the condition of a select is uniform we don't want to widen it, so
> I
> would suggest a special case here instead of calling
> getBroadcastInstrs() via getVectorValue() if there is no mapping in
> WidenMap.
> 
> +      case Instruction::Store: {
> +      case Instruction::Load: {
> 
> What about loads/stores with entirely uniform indices? We could
> simply
> insert a single scalar operation there instead of cloning them to do
> W
> times the same thing.
> 
> +  // Check that the underlying objects of the reads are writes are
> either
> 
> Typo, should be "reads and writes".
> 
> 
> Otherwise, the code looks pretty good to me. :)
> I did not take a close look at the tests, though.
> 
> Cheers,
> Ralf
> 
> [1] "Efficient SIMD Code Generation for Irregular Kernels", Kim and
> Han,
> PPoPP'12
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory