[llvm-commits] [PATCH] LoopVectorizer

Wed Oct 17 09:21:49 PDT 2012

On Oct 17, 2012, at 1:59 AM, Ralf Karrenberg <karrenberg at cdl.uni-saarland.de> wrote:

> Hi,
> 
> Thanks Nadav, this is a great start for the loop vectorization implementation!
> I haven't done a code review for LLVM so far, so I'll be happy to get some feedback for the feedback as well :).
> 
> On 10/17/12 8:30 AM, Hal Finkel wrote:
>>> +    //Check that all of the users of the loop are inside the BB.
>>> +    for (Value::use_iterator it = I->use_begin(), e = I->use_end();
>>> +         it != e; ++it) {
>>> +      Instruction *U = cast<Instruction>(*it);
>>> +      BasicBlock *Parent = U->getParent();
>>> +      if (Parent != &BB) {
>>> +        DEBUG(dbgs() << "LV: Found an outside user for : "<< *U << "\n");
>>> +        return false;
>>> +      }
>>> +    }
>> It seems like handling this case requires just as much code as this check. If there is an outside user, you just need to extract the last vector element and call replaceAllUsesWith(), right?
> I may be mistaken, but I think that would require introducing additional blocks:
> You cannot put an extractelement operation behind the loop since if the loop body is not executed, you have a value that is not a vector. You also don't want to put it inside the loop because then the extract would be executed in every iteration although it is only required once. That would leave you the option of creating a branch around the loop and an additional block between the loop exit and the join block (which holds a phi for each of those values).
> 

I think that this is incorrect.  The in-loop value is a vector while the outside user is a scalar.  Think of variable "sum" which adds values from the loop. We can't just pick the last element.  In the case of reduction variables (which we need to address at some point) we can just redo the ops on all of the vector elements. 

> Other comments:
> 
> +  /// Create a broadcast instruction. This method generates a broadcast
> +  /// instruction (shuffle) for loop invariant values and for the induction
> +  /// value. If this is the induction variable then we extend it to N, N+2, ...
> +  /// this is needed because each iteration in the loop corresponds to a SIMD
> +  /// element.
> +  Value *getBroadcastInstrs(Value *V);
> 
> Shouldn't it be <N, N+1, ... N+W-1> where W is the vectorization factor?

Yes. Fixed the comment.

> +  /// Returns the maximum vectorization factor that we *can* use to vectorize
> +  /// this loop. This does not mean that it is profitable to vectorize this
> +  /// loop, only that it is legal to do so. This may be a large number. We
> +  /// can vectorize to any SIMD width below this number.
> +  unsigned getLoopMaxVF();
> 
> Denote that the number can only be a power of two.
> 
> +  DataLayout *TD;

Renamed. 

> 
> Should probably be "DL" now since it is not called "TargetData" anymore ;).
> 
> getBroadcastInstrs()
> Do you handle constants with this method as well? Is LLVM's constant propagation pass powerful enough to create vector constants from the shuffles? Otherwise, I would have some code that creates vector constants from scalars directly.

This handles constants.  In any case, this is hoisted outside of the loop.

> 
> +Value *SingleBlockLoopVectorizer::getConsecutiveVector(Value* Val) {
> +  assert(Val->getType()->isVectorTy() && "Must be a vector");
> +  assert(Val->getType()->getScalarType()->isIntegerTy() &&
> +         "Elem must be an integer");
> +  Type *ITy = Val->getType()->getScalarType();
> +  VectorType *Ty = cast<VectorType>(Val->getType());
> +  unsigned VF = Ty->getNumElements();
> 
> VF is already a member variable. Remove the last line or convert it into an assertion.

Done. 

> Minor remark: I think getConsecutiveVector() is only required to run once, so the resulting constant could be created in the constructor already unless you don't want more member variables.
> Also, the widening of the induction variable should probably only be done once, so you may want to add another member variable for the widened one or add it to the WidenMap.
> 

I agree. I will fix this later. 

> +  for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) {
> 
> Is there a reason not to use op_begin()/op_end()?

I think that it is easier to access the last index this way. 

> 
> ScalarizeInstruction()
> You create new extract operations every time an instruction is used as an operand to a non-vectorizable instruction. Can we rely on subsequent optimization phases to remove this redundancy again?
> 

This is okay. Because of two optimizations:

1. dead code removes unused instructions.  

2. extract from a vector which is build from a sequence of insert-elements is optimized to using the original elements.  

So, even if we have a sequence of scalar instructions, we won't see the accordion effect where we open-and-close vectors. 

> More importantly however, you create extract/insert operations for every non-vectorizable instruction independently. At some point, we probably need some heuristic that analyzes chains of instructions: in cases like alternating vectorizable and non-vectorizable instructions it may be beneficial to scalarize instructions that are vectorizable to reduce pack/unpack costs [1].

We need a cost model, but this is independent of the optimizations i mentioned above. 

> 
> +  // Get the trip count from the BE count by adding 1.
> +  ExitCount = SE->getAddExpr(ExitCount,
> + SE->getConstant(ExitCount->getType(), 1));
> 
> Bad alignment on the second argument. "BE" is a typo I think, it does not appear anywhere else.
> 
> case Instruction::Select: {
> +        // Widen selects.
> +        Value *A = getVectorValue(Inst->getOperand(0));
> +        Value *B = getVectorValue(Inst->getOperand(1));
> +        Value *C = getVectorValue(Inst->getOperand(2));
> +        WidenMap[Inst] = Builder->CreateSelect(A, B, C);
> +        break;
> +      }

> 
> If the condition of a select is uniform we don't want to widen it, so I would suggest a special case here instead of calling getBroadcastInstrs() via getVectorValue() if there is no mapping in WidenMap.
> 

Yes. We need to check if the selector is uniform.   This is a good optimization. 

> +      case Instruction::Store: {
> +      case Instruction::Load: {
> 
> What about loads/stores with entirely uniform indices?

Also a good optimization. Will do. 

> We could simply insert a single scalar operation there instead of cloning them to do W times the same thing.
> 
> +  // Check that the underlying objects of the reads are writes are either
> 
> Typo, should be "reads and writes".

Fixed.

> 
> 
> Otherwise, the code looks pretty good to me. :)
> I did not take a close look at the tests, though.
> 

Thanks for the review ralf!

> Cheers,
> Ralf
> 
> [1] "Efficient SIMD Code Generation for Irregular Kernels", Kim and Han, PPoPP'12