RFC: Enable vectorization of call instructions in the loop vectorizer

Mon Mar 17 09:51:18 PDT 2014

Overall this looks great. I have some comments below. Did you run the test suite and made sure that no changes are observed?


@@ -434,7 +438,7 @@
     for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
       if (Tys[i]->isVectorTy()) {
         ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
-        ScalarCalls = std::max(ScalarCalls, RetTy->getVectorNumElements());
+        ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
       }
     }
 
@@ -493,13 +497,40 @@
     unsigned Num = RetTy->getVectorNumElements();
     unsigned Cost = TopTTI->getIntrinsicInstrCost(IID, RetTy->getScalarType(),
                                                   Tys);
-    return 10 * Cost * Num;
+    return Cost * Num;
   }
 
   // This is going to be turned into a library call, make it expensive.
   return 10;
 }


These two changes should go in as a separate patch. They are fixes to the cost model.

If we remove the factor of 10 in the second hunk we should add a scalarization cost, otherwise we would just be estimating the cost of the scalar calls.



@@ -829,11 +833,6 @@
   /// width. Vector width of one means scalar.
   unsigned getInstructionCost(Instruction *I, unsigned VF);
 
-  /// A helper function for converting Scalar types to vector types.
-  /// If the incoming type is void, we return void. If the VF is 1, we return
-  /// the scalar type.
-  static Type* ToVectorTy(Type *Scalar, unsigned VF);
-
   /// Returns whether the instruction is a load or store and will be a emitted
   /// as a vector operation.
   bool isConsecutiveLoadOrStore(Instruction *I);

@@ -1224,6 +1223,15 @@
   return SE->getSCEV(Ptr);
 }
 
+/// A helper function for converting Scalar types to vector types.
+/// If the incoming type is void, we return void. If the VF is 1, we return
+/// the scalar type.
+static Type* ToVectorTy(Type *Scalar, unsigned VF) {
+  if (Scalar->isVoidTy() || VF == 1)
+    return Scalar;
+  return VectorType::get(Scalar, VF);
+}
+

This is cleanup and should be split into a separate patch.


Thanks for working on this.


On Mar 17, 2014, at 7:38 AM, James Molloy <james at jamesmolloy.co.uk> wrote:

> Hi Arnold,
> 
> Sorry for the large delay in this - I've been working on this in my spare time and haven't had much of that lately! :)
> 
> This version of the patch:
> 
>   * Addresses your three points in your previous email.
>   * Adds support for the Accelerate library, but I only added support for one function in it (expf) for testing purposes. There is a fixme for someone with more Apple knowledge and ability to test than me to fill in the rest.
>   * Updates to ToT and updates TargetLibraryInfo to use C++11 lambdas in std::lower_bound rather than functors.
> 
> Does it look better?
> 
> Cheers,
> 
> James
> 
> 
> On 17 January 2014 17:22, James Molloy <james at jamesmolloy.co.uk> wrote:
> Awesome, thanks Arnold! Very clear now.
> 
> 
> On 17 January 2014 16:45, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
> 
> On Jan 17, 2014, at 2:59 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
> 
> > Hi Arnold,
> >
> > > First, we are going to have the situation where there exists an intrinsic ID for a library function (many math library functions have an intrinsic version: expf -> llvm.exp.32 for example). As a consequence “getIntrinsicIDForCall” will return it. In this case we can have both: a vectorized library function version and an intrinsic function that maybe slower or faster. In such a case the cost model has to decide which one to pick. This means we have to query the cost model which one is cheaper in two places: when get the instruction cost and when we vectorize the call.
> >
> > Sure, I will address this.
> >
> > > Second, the way we test this. [snip]
> >
> > This is very sensible. The only reason I didn't go down this route to start with was that I didn't know of an available library (like Accelerate) and didn't want to add testing/dummy code in tree. Thanks for pointing me at Accelerate - that'll give me a real library to (semi) implement and test.
> >
> > > This brings me to issue three. You are currently using TTI->getCallCost() which is not meant to be used with the vectorizer. We should create a getCallInstrCost() function similar to the “getIntrinsicInstrCost” function we already have.
> > >
> > > BasicTTI::getCallInstrCost should query TLI->isFunctionVectorizable() and return a sensible value in this case (one that is lower than a scalarized intrinsic lowered as lib call).
> >
> > I don't understand the difference between getIntrinsicCost and getIntrinsicInstrCost. They both take the same arguments (but return different values), and the doxygen docstring does not describe the action in enough detail to discern what the required behaviour is.
> >
> > Could you please tell me? (and I'll update the docstrings while I'm at it).
> 
> Sure, TargetTransformInfo is split into two “cost” metrics:
> 
> * Generic target information which returns its cost in terms of “TargetCostConstants”:
> 
>   /// \name Generic Target Information
>   /// @{
> 
>   /// \brief Underlying constants for 'cost' values in this interface.
>   ///
>   /// Many APIs in this interface return a cost. This enum defines the
>   /// fundamental values that should be used to interpret (and produce) those
>   /// costs. The costs are returned as an unsigned rather than a member of this
>   /// enumeration because it is expected that the cost of one IR instruction
>   /// may have a multiplicative factor to it or otherwise won't fit directly
>   /// into the enum. Moreover, it is common to sum or average costs which works
>   /// better as simple integral values. Thus this enum only provides constants.
>   …
>   /// @}
> 
> This api is used by the inliner (getUserCost) to estimate the cost (size) of instructions.
> 
> * Throughput estimate for the vectorizer. This api attempts to estimate (very crudely on a instruction per instruction basis) the throughput of instructions (since we automatically infer most values using TargetLoweringInfo, and we have to do this from IR this is not going to be very accurate …).
> 
>  /// \name Vector Target Information
>  /// @{
>  ...
>  /// \return The expected cost of arithmetic ops, such as mul, xor, fsub, et
>  virtual unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,
>  ...
>  /// @}
> 
> At a high level, this api tries to answer the question: What does this instruction cost in a scalar form (“expf”, f32). Or what does this instruction cost in a vectorized form (“expf”, <4 x float>).
> 
> BasicTTI::getIntrinsicInstrCost() assumes a cost of 1 for intrinsics that have a corresponding ISA instruction (TLoweringI->isOperationLegalOrPromote(ISD:FEXP) returns true), a cost of 10 for the ones that don’t and then we also incorporate things like type legalization costs, and overhead if we vectorize.
> 
> For the new BasicTTI::getCallInstrCost(Function, RetTy, ArgTys) we would also return 10 for scalar versions of the function (RetTy->isVectorTy() == false).
> For vector queries (RetTy->isVectorTy()==true), if there is a a TLibInfo->isVectorizableFunction(Function->getCalledFunction->getName(), RetTy->getVectorNumElements()) we should also return 10. Otherwise, we estimate the cost of scalarization just like we do in getIntrinsicInstrCost. This will guarantee that the vectorize library function call (Cost = 10) will be chosen over the intrinsic lowered to a sequence of scalarized lib calls (Cost = 10 * VF * …).
> 
> Then, in LoopVectorizationCostModel::getInstructionCost() you would query both (if getInstrinsicIDForCall returns an id) apis and return the smallest:
> 
>  case Call:
>    CallInst *CI = cast<CallInst>(I);
> 
> 
>     Type *RetTy = ToVectorTy(CI->getType(), VF);
>     SmallVector<Type*, 4> Tys;
>     for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
>       Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
>     unsigned LibFuncCallCost = TTI.getCallInstrCost(CI->getCalledFunction(), RetTy, Tys);
> 
>     if (unsigned ID = getIntrinsicIDForCall(CI, TLI))
>       return std::min(LibFuncCallCost, TTI.getIntrinsicInstrCost(ID, RetTy, Tys));
>    return LibFuncCallCost;
> 
> 
> Thanks,
> Arnold
> 
>    
> 
> 
> <vectorize-calls.diff>