RFC: Enable vectorization of call instructions in the loop vectorizer

Wed Mar 26 07:15:02 PDT 2014

Hi Arnold,

I've run the patch through LNT and I'm sufficiently satisfied that it
doesn't regress performance. The variability in the test-suite makes it
difficult to say for certain, however (I used a multi-sample of 5, and
there were some regressions in the very short-running tests and some
improvements too).

I made the change you asked, and attached are three patches. One for fixing
the cost model (I added a scalarization cost here too), one for moving
ToVectorTy around and the third is the full change (obviously I'll rebase
the third on the first two once they're committed).

Is it now OK to commit?

Cheers,

James


On 17 March 2014 16:58, James Molloy <James.Molloy at arm.com> wrote:

> Hi Arnold,
>
> I'm running the test suite now. I can create separate patches for the
> hunks you asked for - the final patch will then rely on those.
>
> > If we remove the factor of 10 in the second hunk we should add a
> > scalarization cost, otherwise we would just be estimating the cost of the
> > scalar calls.
>
> The "Cost" variable already has the scalarization cost of extracting the
> parameter to the scalar call from the vector value - the only thing that is
> missing is the cost of inserting all the scalar return values into a
> vector. I'll add that. With the "10 *", it was double counting cost and was
> ending up at 400 for vectorizing llvm.exp.f32 on X86... which is about 10
> times the cost of vectorizing the libcall to expf()! So a test was failing.
>
> Cheers,
>
> James
>
> > -----Original Message-----
> > From: Arnold Schwaighofer [mailto:aschwaighofer at apple.com]
> > Sent: 17 March 2014 16:51
> > To: James Molloy
> > Cc: Renato Golin; James Molloy; llvm-commits
> > Subject: Re: RFC: Enable vectorization of call instructions in the loop
> > vectorizer
> >
> > Overall this looks great. I have some comments below. Did you run the
> test
> > suite and made sure that no changes are observed?
> >
> >
> > @@ -434,7 +438,7 @@
> >      for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
> >        if (Tys[i]->isVectorTy()) {
> >          ScalarizationCost += getScalarizationOverhead(Tys[i], false,
> true);
> > -        ScalarCalls = std::max(ScalarCalls,
> RetTy->getVectorNumElements());
> > +        ScalarCalls = std::max(ScalarCalls,
> Tys[i]->getVectorNumElements());
> >        }
> >      }
> >
> > @@ -493,13 +497,40 @@
> >      unsigned Num = RetTy->getVectorNumElements();
> >      unsigned Cost = TopTTI->getIntrinsicInstrCost(IID,
> RetTy->getScalarType(),
> >                                                    Tys);
> > -    return 10 * Cost * Num;
> > +    return Cost * Num;
> >    }
> >
> >    // This is going to be turned into a library call, make it expensive.
> >    return 10;
> >  }
> >
> >
> > These two changes should go in as a separate patch. They are fixes to the
> > cost model.
> >
> > If we remove the factor of 10 in the second hunk we should add a
> > scalarization cost, otherwise we would just be estimating the cost of the
> > scalar calls.
> >
> >
> >
> > @@ -829,11 +833,6 @@
> >    /// width. Vector width of one means scalar.
> >    unsigned getInstructionCost(Instruction *I, unsigned VF);
> >
> > -  /// A helper function for converting Scalar types to vector types.
> > -  /// If the incoming type is void, we return void. If the VF is 1, we
> return
> > -  /// the scalar type.
> > -  static Type* ToVectorTy(Type *Scalar, unsigned VF);
> > -
> >    /// Returns whether the instruction is a load or store and will be a
> emitted
> >    /// as a vector operation.
> >    bool isConsecutiveLoadOrStore(Instruction *I);
> >
> > @@ -1224,6 +1223,15 @@
> >    return SE->getSCEV(Ptr);
> >  }
> >
> > +/// A helper function for converting Scalar types to vector types.
> > +/// If the incoming type is void, we return void. If the VF is 1, we
> return
> > +/// the scalar type.
> > +static Type* ToVectorTy(Type *Scalar, unsigned VF) {
> > +  if (Scalar->isVoidTy() || VF == 1)
> > +    return Scalar;
> > +  return VectorType::get(Scalar, VF);
> > +}
> > +
> >
> > This is cleanup and should be split into a separate patch.
> >
> >
> > Thanks for working on this.
> >
> >
> > On Mar 17, 2014, at 7:38 AM, James Molloy <james at jamesmolloy.co.uk>
> > wrote:
> >
> > > Hi Arnold,
> > >
> > > Sorry for the large delay in this - I've been working on this in my
> spare time
> > and haven't had much of that lately! :)
> > >
> > > This version of the patch:
> > >
> > >   * Addresses your three points in your previous email.
> > >   * Adds support for the Accelerate library, but I only added support
> for one
> > function in it (expf) for testing purposes. There is a fixme for someone
> with
> > more Apple knowledge and ability to test than me to fill in the rest.
> > >   * Updates to ToT and updates TargetLibraryInfo to use C++11 lambdas
> in
> > std::lower_bound rather than functors.
> > >
> > > Does it look better?
> > >
> > > Cheers,
> > >
> > > James
> > >
> > >
> > > On 17 January 2014 17:22, James Molloy <james at jamesmolloy.co.uk>
> > wrote:
> > > Awesome, thanks Arnold! Very clear now.
> > >
> > >
> > > On 17 January 2014 16:45, Arnold Schwaighofer
> > <aschwaighofer at apple.com> wrote:
> > >
> > > On Jan 17, 2014, at 2:59 AM, James Molloy <james at jamesmolloy.co.uk>
> > wrote:
> > >
> > > > Hi Arnold,
> > > >
> > > > > First, we are going to have the situation where there exists an
> intrinsic
> > ID for a library function (many math library functions have an intrinsic
> version:
> > expf -> llvm.exp.32 for example). As a consequence
> "getIntrinsicIDForCall"
> > will return it. In this case we can have both: a vectorized library
> function
> > version and an intrinsic function that maybe slower or faster. In such a
> case
> > the cost model has to decide which one to pick. This means we have to
> query
> > the cost model which one is cheaper in two places: when get the
> instruction
> > cost and when we vectorize the call.
> > > >
> > > > Sure, I will address this.
> > > >
> > > > > Second, the way we test this. [snip]
> > > >
> > > > This is very sensible. The only reason I didn't go down this route
> to start
> > with was that I didn't know of an available library (like Accelerate)
> and didn't
> > want to add testing/dummy code in tree. Thanks for pointing me at
> > Accelerate - that'll give me a real library to (semi) implement and test.
> > > >
> > > > > This brings me to issue three. You are currently using
> TTI->getCallCost()
> > which is not meant to be used with the vectorizer. We should create a
> > getCallInstrCost() function similar to the "getIntrinsicInstrCost"
> function we
> > already have.
> > > > >
> > > > > BasicTTI::getCallInstrCost should query
> TLI->isFunctionVectorizable()
> > and return a sensible value in this case (one that is lower than a
> scalarized
> > intrinsic lowered as lib call).
> > > >
> > > > I don't understand the difference between getIntrinsicCost and
> > getIntrinsicInstrCost. They both take the same arguments (but return
> > different values), and the doxygen docstring does not describe the
> action in
> > enough detail to discern what the required behaviour is.
> > > >
> > > > Could you please tell me? (and I'll update the docstrings while I'm
> at it).
> > >
> > > Sure, TargetTransformInfo is split into two "cost" metrics:
> > >
> > > * Generic target information which returns its cost in terms of
> > "TargetCostConstants":
> > >
> > >   /// \name Generic Target Information
> > >   /// @{
> > >
> > >   /// \brief Underlying constants for 'cost' values in this interface.
> > >   ///
> > >   /// Many APIs in this interface return a cost. This enum defines the
> > >   /// fundamental values that should be used to interpret (and produce)
> > those
> > >   /// costs. The costs are returned as an unsigned rather than a
> member of
> > this
> > >   /// enumeration because it is expected that the cost of one IR
> instruction
> > >   /// may have a multiplicative factor to it or otherwise won't fit
> directly
> > >   /// into the enum. Moreover, it is common to sum or average costs
> which
> > works
> > >   /// better as simple integral values. Thus this enum only provides
> > constants.
> > >   ...
> > >   /// @}
> > >
> > > This api is used by the inliner (getUserCost) to estimate the cost
> (size) of
> > instructions.
> > >
> > > * Throughput estimate for the vectorizer. This api attempts to estimate
> > (very crudely on a instruction per instruction basis) the throughput of
> > instructions (since we automatically infer most values using
> > TargetLoweringInfo, and we have to do this from IR this is not going to
> be
> > very accurate ...).
> > >
> > >  /// \name Vector Target Information
> > >  /// @{
> > >  ...
> > >  /// \return The expected cost of arithmetic ops, such as mul, xor,
> fsub, et
> > >  virtual unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,
> > >  ...
> > >  /// @}
> > >
> > > At a high level, this api tries to answer the question: What does this
> > instruction cost in a scalar form ("expf", f32). Or what does this
> instruction
> > cost in a vectorized form ("expf", <4 x float>).
> > >
> > > BasicTTI::getIntrinsicInstrCost() assumes a cost of 1 for intrinsics
> that have a
> > corresponding ISA instruction (TLoweringI-
> > >isOperationLegalOrPromote(ISD:FEXP) returns true), a cost of 10 for the
> > ones that don't and then we also incorporate things like type
> legalization
> > costs, and overhead if we vectorize.
> > >
> > > For the new BasicTTI::getCallInstrCost(Function, RetTy, ArgTys) we
> would
> > also return 10 for scalar versions of the function (RetTy->isVectorTy()
> ==
> > false).
> > > For vector queries (RetTy->isVectorTy()==true), if there is a a
> TLibInfo-
> > >isVectorizableFunction(Function->getCalledFunction->getName(), RetTy-
> > >getVectorNumElements()) we should also return 10. Otherwise, we
> > estimate the cost of scalarization just like we do in
> getIntrinsicInstrCost. This
> > will guarantee that the vectorize library function call (Cost = 10) will
> be
> > chosen over the intrinsic lowered to a sequence of scalarized lib calls
> (Cost =
> > 10 * VF * ...).
> > >
> > > Then, in LoopVectorizationCostModel::getInstructionCost() you would
> > query both (if getInstrinsicIDForCall returns an id) apis and return the
> > smallest:
> > >
> > >  case Call:
> > >    CallInst *CI = cast<CallInst>(I);
> > >
> > >
> > >     Type *RetTy = ToVectorTy(CI->getType(), VF);
> > >     SmallVector<Type*, 4> Tys;
> > >     for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
> > >       Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
> > >     unsigned LibFuncCallCost =
> TTI.getCallInstrCost(CI->getCalledFunction(),
> > RetTy, Tys);
> > >
> > >     if (unsigned ID = getIntrinsicIDForCall(CI, TLI))
> > >       return std::min(LibFuncCallCost, TTI.getIntrinsicInstrCost(ID,
> RetTy,
> > Tys));
> > >    return LibFuncCallCost;
> > >
> > >
> > > Thanks,
> > > Arnold
> > >
> > >
> > >
> > >
> > > <vectorize-calls.diff>
> >
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium.  Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2548782
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140326/336929cf/attachment.html>
-------------- next part --------------
Index: test/Analysis/CostModel/X86/intrinsic-cost.ll
===================================================================

--- test/Analysis/CostModel/X86/intrinsic-cost.ll	(revision 204319)
+++ test/Analysis/CostModel/X86/intrinsic-cost.ll	(working copy)
@@ -22,7 +22,7 @@
   ret void
 
 ; CORE2: Printing analysis 'Cost Model Analysis' for function 'test1':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 46 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
 
 ; COREI7: Printing analysis 'Cost Model Analysis' for function 'test1':
 ; COREI7: Cost Model: Found an estimated cost of 1 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
@@ -50,7 +50,7 @@
   ret void
 
 ; CORE2: Printing analysis 'Cost Model Analysis' for function 'test2':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 46 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
 
 ; COREI7: Printing analysis 'Cost Model Analysis' for function 'test2':
 ; COREI7: Cost Model: Found an estimated cost of 1 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
Index: lib/CodeGen/BasicTargetTransformInfo.cpp
===================================================================
--- lib/CodeGen/BasicTargetTransformInfo.cpp	(revision 204319)
+++ lib/CodeGen/BasicTargetTransformInfo.cpp	(working copy)
@@ -434,7 +438,7 @@
     for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
       if (Tys[i]->isVectorTy()) {
         ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
-        ScalarCalls = std::max(ScalarCalls, RetTy->getVectorNumElements());
+        ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
       }
     }
 
@@ -493,13 +497,21 @@
     unsigned Num = RetTy->getVectorNumElements();
     unsigned Cost = TopTTI->getIntrinsicInstrCost(IID, RetTy->getScalarType(),
                                                   Tys);
-    return 10 * Cost * Num;
+    unsigned ScalarizationCost = 0;
+    if (RetTy->isVectorTy())
+      ScalarizationCost = getScalarizationOverhead(RetTy, true, false);
+    for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
+      if (Tys[i]->isVectorTy())
+        ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
+    }
+
+    return Cost * Num + ScalarizationCost;
   }
 
   // This is going to be turned into a library call, make it expensive.
   return 10;
 }
 
 unsigned BasicTTI::getNumberOfParts(Type *Tp) const {
   std::pair<unsigned, MVT> LT = getTLI()->getTypeLegalizationCost(Tp);
   return LT.first;
-------------- next part --------------
Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/LoopVectorize.cpp	(revision 204319)
+++ lib/Transforms/Vectorize/LoopVectorize.cpp	(working copy)
@@ -829,11 +833,6 @@
   /// width. Vector width of one means scalar.
   unsigned getInstructionCost(Instruction *I, unsigned VF);
 
-  /// A helper function for converting Scalar types to vector types.
-  /// If the incoming type is void, we return void. If the VF is 1, we return
-  /// the scalar type.
-  static Type* ToVectorTy(Type *Scalar, unsigned VF);
-
   /// Returns whether the instruction is a load or store and will be a emitted
   /// as a vector operation.
   bool isConsecutiveLoadOrStore(Instruction *I);
@@ -1217,6 +1216,15 @@
   return SE->getSCEV(Ptr);
 }
 
+/// A helper function for converting Scalar types to vector types.
+/// If the incoming type is void, we return void. If the VF is 1, we return
+/// the scalar type.
+static Type* ToVectorTy(Type *Scalar, unsigned VF) {
+  if (Scalar->isVoidTy() || VF == 1)
+    return Scalar;
+  return VectorType::get(Scalar, VF);
+}
+
 void LoopVectorizationLegality::RuntimePointerCheck::insert(
     ScalarEvolution *SE, Loop *Lp, Value *Ptr, bool WritePtr, unsigned DepSetId,
     ValueToValueMap &Strides) {
@@ -5642,12 +5741,6 @@
   }// end of switch.
 }
 
-Type* LoopVectorizationCostModel::ToVectorTy(Type *Scalar, unsigned VF) {
-  if (Scalar->isVoidTy() || VF == 1)
-    return Scalar;
-  return VectorType::get(Scalar, VF);
-}
-
 char LoopVectorize::ID = 0;
 static const char lv_name[] = "Loop Vectorization";
 INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
-------------- next part --------------
Index: test/Transforms/LoopVectorize/libcall.ll
===================================================================
--- test/Transforms/LoopVectorize/libcall.ll	(revision 0)
+++ test/Transforms/LoopVectorize/libcall.ll	(working copy)
@@ -0,0 +1,55 @@
+; RUN: opt < %s  -loop-vectorize -S | FileCheck %s
+
+target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
+target triple = "armv7-apple-macos-Accelerate"
+
+;CHECK-LABEL: @exp_intrinsic_f32(
+;CHECK: vexp
+;CHECK: ret void
+define void @exp_intrinsic_f32(i32 %n, float* noalias %y, float* noalias %x) nounwind uwtable {
+entry:
+  %cmp6 = icmp sgt i32 %n, 0
+  br i1 %cmp6, label %for.body, label %for.end
+
+for.body:                                         ; preds = %entry, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
+  %arrayidx = getelementptr inbounds float* %y, i64 %indvars.iv
+  %0 = load float* %arrayidx, align 4
+  %call = tail call float @llvm.exp.f32(float %0) nounwind readnone
+  %arrayidx2 = getelementptr inbounds float* %x, i64 %indvars.iv
+  store float %call, float* %arrayidx2, align 4
+  %indvars.iv.next = add i64 %indvars.iv, 1
+  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+  %exitcond = icmp eq i32 %lftr.wideiv, %n
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body, %entry
+  ret void
+}
+
+;CHECK-LABEL: @exp_libcall_f32(
+;CHECK: vexp
+;CHECK: ret void
+define void @exp_libcall_f32(i32 %n, float* noalias %y, float* noalias %x) nounwind uwtable {
+entry:
+  %cmp6 = icmp sgt i32 %n, 0
+  br i1 %cmp6, label %for.body, label %for.end
+
+for.body:                                         ; preds = %entry, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
+  %arrayidx = getelementptr inbounds float* %y, i64 %indvars.iv
+  %0 = load float* %arrayidx, align 4
+  %call = tail call float @expf(float %0) nounwind readnone
+  %arrayidx2 = getelementptr inbounds float* %x, i64 %indvars.iv
+  store float %call, float* %arrayidx2, align 4
+  %indvars.iv.next = add i64 %indvars.iv, 1
+  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+  %exitcond = icmp eq i32 %lftr.wideiv, %n
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body, %entry
+  ret void
+}
+
+declare float @llvm.exp.f32(float) nounwind readnone
+declare float @expf(float) nounwind readnone
\ No newline at end of file
Index: test/Transforms/LoopVectorize/funcall.ll
===================================================================
--- test/Transforms/LoopVectorize/funcall.ll	(revision 204319)
+++ test/Transforms/LoopVectorize/funcall.ll	(working copy)
@@ -1,6 +1,7 @@
 ; RUN: opt -S -loop-vectorize -force-vector-width=2 -force-vector-unroll=1 < %s | FileCheck %s
 
 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
+target triple = "armv7-apple-macos-Accelerate"
 
 ; Make sure we can vectorize loops with functions to math library functions.
 ; They might read the rounding mode but we are only vectorizing loops that
Index: test/Analysis/CostModel/X86/intrinsic-cost.ll
===================================================================
--- test/Analysis/CostModel/X86/intrinsic-cost.ll	(revision 204319)
+++ test/Analysis/CostModel/X86/intrinsic-cost.ll	(working copy)
@@ -22,7 +22,7 @@
   ret void
 
 ; CORE2: Printing analysis 'Cost Model Analysis' for function 'test1':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 40 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
 
 ; COREI7: Printing analysis 'Cost Model Analysis' for function 'test1':
 ; COREI7: Cost Model: Found an estimated cost of 1 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
@@ -50,7 +50,7 @@
   ret void
 
 ; CORE2: Printing analysis 'Cost Model Analysis' for function 'test2':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 40 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
 
 ; COREI7: Printing analysis 'Cost Model Analysis' for function 'test2':
 ; COREI7: Cost Model: Found an estimated cost of 1 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
Index: include/llvm/Analysis/TargetTransformInfo.h
===================================================================
--- include/llvm/Analysis/TargetTransformInfo.h	(revision 204319)
+++ include/llvm/Analysis/TargetTransformInfo.h	(working copy)
@@ -389,6 +389,10 @@
   virtual unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
                                          ArrayRef<Type *> Tys) const;
 
+  /// \returns The cost of Call instructions.
+  virtual unsigned getCallInstrCost(Function *F, Type *RetTy,
+                                    ArrayRef<Type *> Tys) const;
+
   /// \returns The number of pieces into which the provided type must be
   /// split during legalization. Zero is returned when the answer is unknown.
   virtual unsigned getNumberOfParts(Type *Tp) const;
Index: include/llvm/Target/TargetLibraryInfo.h
===================================================================
--- include/llvm/Target/TargetLibraryInfo.h	(revision 204319)
+++ include/llvm/Target/TargetLibraryInfo.h	(working copy)
@@ -11,6 +11,7 @@
 #define LLVM_TARGET_TARGETLIBRARYINFO_H
 
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/ArrayRef.h"
 #include "llvm/Pass.h"
 
 namespace llvm {
@@ -673,10 +674,26 @@
 /// library functions are available for the current target, and allows a
 /// frontend to disable optimizations through -fno-builtin etc.
 class TargetLibraryInfo : public ImmutablePass {
+public:
+  /// VecDesc - Describes a possible vectorization of a function.
+  /// Function 'VectorFnName' is equivalent to 'ScalarFnName' vectorized
+  /// by a factor 'VectorizationFactor'.
+  struct VecDesc {
+    const char *ScalarFnName;
+    const char *VectorFnName;
+    unsigned VectorizationFactor;
+  };
+
+private:
   virtual void anchor();
   unsigned char AvailableArray[(LibFunc::NumLibFuncs+3)/4];
   llvm::DenseMap<unsigned, std::string> CustomNames;
   static const char* StandardNames[LibFunc::NumLibFuncs];
+  /// Vectorization descriptors - sorted by ScalarFnName.
+  std::vector<VecDesc> VectorDescs;
+  /// Scalarization descriptors - same content as VectorDescs but sorted based
+  /// on VectorFnName rather than ScalarFnName.
+  std::vector<VecDesc> ScalarDescs;
 
   enum AvailabilityState {
     StandardName = 3, // (memset to all ones)
@@ -772,6 +789,38 @@
   /// disableAllFunctions - This disables all builtins, which is used for
   /// options like -fno-builtin.
   void disableAllFunctions();
+
+  /// addVectorizableFunctions - Add a set of scalar -> vector mappings,
+  /// queryable via getVectorizedFunction and getScalarizedFunction.
+  void addVectorizableFunctions(ArrayRef<VecDesc> Fns);
+
+  /// isFunctionVectorizable - Return true if the function F has a
+  /// vector equivalent with vectorization factor VF.
+  bool isFunctionVectorizable(StringRef F, unsigned VF) const {
+    return !getVectorizedFunction(F, VF).empty();
+  }
+
+  /// isFunctionVectorizable - Return true if the function F has a
+  /// vector equivalent with any vectorization factor.
+  bool isFunctionVectorizable(StringRef F) const;
+
+  /// getVectorizedFunction - Return the name of the equivalent of 
+  /// F, vectorized with factor VF. If no such mapping exists,
+  /// return the empty string.
+  StringRef getVectorizedFunction(StringRef F, unsigned VF) const;
+
+  /// isFunctionScalarizable - Return true if the function F has a
+  /// scalar equivalent, and set VF to be the vectorization factor.
+  bool isFunctionScalarizable(StringRef F, unsigned &VF) const {
+    return !getScalarizedFunction(F, VF).empty();
+  }
+
+  /// getScalarizedFunction - Return the name of the equivalent of 
+  /// F, scalarized. If no such mapping exists, return the empty string.
+  ///
+  /// Set VF to the vectorization factor.
+  StringRef getScalarizedFunction(StringRef F, unsigned &VF) const;
+
 };
 
 } // end namespace llvm
Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/LoopVectorize.cpp	(revision 204319)
+++ lib/Transforms/Vectorize/LoopVectorize.cpp	(working copy)
@@ -220,11 +220,12 @@
 public:
   InnerLoopVectorizer(Loop *OrigLoop, ScalarEvolution *SE, LoopInfo *LI,
                       DominatorTree *DT, const DataLayout *DL,
-                      const TargetLibraryInfo *TLI, unsigned VecWidth,
-                      unsigned UnrollFactor)
-      : OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), DL(DL), TLI(TLI),
-        VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()), Induction(0),
-        OldInduction(0), WidenMap(UnrollFactor), Legal(0) {}
+                      const TargetLibraryInfo *TLI,
+                      const TargetTransformInfo *TTI,
+                      unsigned VecWidth, unsigned UnrollFactor)
+    : OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), DL(DL), TLI(TLI), TTI(TTI),
+      VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()), Induction(0),
+      OldInduction(0), WidenMap(UnrollFactor), Legal(0) {}
 
   // Perform the actual loop widening (vectorization).
   void vectorize(LoopVectorizationLegality *L) {
@@ -382,6 +383,8 @@
   const DataLayout *DL;
   /// Target Library Info.
   const TargetLibraryInfo *TLI;
+  /// Target Transform Info.
+  const TargetTransformInfo *TTI;
 
   /// The vectorization SIMD factor to use. Each vector will have this many
   /// vector elements.
@@ -429,8 +432,9 @@
 public:
   InnerLoopUnroller(Loop *OrigLoop, ScalarEvolution *SE, LoopInfo *LI,
                     DominatorTree *DT, const DataLayout *DL,
-                    const TargetLibraryInfo *TLI, unsigned UnrollFactor) :
-    InnerLoopVectorizer(OrigLoop, SE, LI, DT, DL, TLI, 1, UnrollFactor) { }
+                    const TargetLibraryInfo *TLI, const TargetTransformInfo *TTI,
+                    unsigned UnrollFactor) :
+    InnerLoopVectorizer(OrigLoop, SE, LI, DT, DL, TLI, TTI, 1, UnrollFactor) { }
 
 private:
   void scalarizeInstruction(Instruction *Instr,
@@ -829,11 +833,6 @@
   /// width. Vector width of one means scalar.
   unsigned getInstructionCost(Instruction *I, unsigned VF);
 
-  /// A helper function for converting Scalar types to vector types.
-  /// If the incoming type is void, we return void. If the VF is 1, we return
-  /// the scalar type.
-  static Type* ToVectorTy(Type *Scalar, unsigned VF);
-
   /// Returns whether the instruction is a load or store and will be a emitted
   /// as a vector operation.
   bool isConsecutiveLoadOrStore(Instruction *I);
@@ -1139,11 +1138,11 @@
         return false;
       DEBUG(dbgs() << "LV: Trying to at least unroll the loops.\n");
       // We decided not to vectorize, but we may want to unroll.
-      InnerLoopUnroller Unroller(L, SE, LI, DT, DL, TLI, UF);
+      InnerLoopUnroller Unroller(L, SE, LI, DT, DL, TLI, TTI, UF);
       Unroller.vectorize(&LVL);
     } else {
       // If we decided that it is *legal* to vectorize the loop then do it.
-      InnerLoopVectorizer LB(L, SE, LI, DT, DL, TLI, VF.Width, UF);
+      InnerLoopVectorizer LB(L, SE, LI, DT, DL, TLI, TTI, VF.Width, UF);
       LB.vectorize(&LVL);
     }
 
@@ -1217,6 +1216,15 @@
   return SE->getSCEV(Ptr);
 }
 
+/// A helper function for converting Scalar types to vector types.
+/// If the incoming type is void, we return void. If the VF is 1, we return
+/// the scalar type.
+static Type* ToVectorTy(Type *Scalar, unsigned VF) {
+  if (Scalar->isVoidTy() || VF == 1)
+    return Scalar;
+  return VectorType::get(Scalar, VF);
+}
+
 void LoopVectorizationLegality::RuntimePointerCheck::insert(
     ScalarEvolution *SE, Loop *Lp, Value *Ptr, bool WritePtr, unsigned DepSetId,
     ValueToValueMap &Strides) {
@@ -3108,28 +3116,105 @@
 
       Module *M = BB->getParent()->getParent();
       CallInst *CI = cast<CallInst>(it);
+      StringRef FnName = CI->getCalledFunction()->getName();
+      Function *F = CI->getCalledFunction();
+      Type *RetTy = ToVectorTy(CI->getType(), VF);
+      SmallVector<Type*, 4> Tys;
+      for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
+        Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
+
       Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);
-      assert(ID && "Not an intrinsic call!");
-      switch (ID) {
-      case Intrinsic::lifetime_end:
-      case Intrinsic::lifetime_start:
-        scalarizeInstruction(it);
-        break;
-      default:
+      if (ID && TTI->getIntrinsicInstrCost(ID, RetTy, Tys) <
+          TTI->getCallInstrCost(F, RetTy, Tys)) {
+        switch (ID) {
+        case Intrinsic::lifetime_end:
+        case Intrinsic::lifetime_start:
+          scalarizeInstruction(it);
+          break;
+        default:
+          for (unsigned Part = 0; Part < UF; ++Part) {
+            SmallVector<Value *, 4> Args;
+            for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
+              VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
+              Args.push_back(Arg[Part]);
+            }
+            Type *Tys[] = {CI->getType()};
+            if (VF > 1)
+              Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
+
+            Function *F = Intrinsic::getDeclaration(M, ID, Tys);
+            Entry[Part] = Builder.CreateCall(F, Args);
+          }
+          break;
+        }
+      } else if (TLI && TLI->isFunctionVectorizable(FnName, VF)) {
+        // This is a function with a vector form.
+        StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
+        assert(!VFnName.empty());
+
+        Function *VectorF = M->getFunction(VFnName);
+        if (!VectorF) {
+          // Generate a declaration
+          FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
+          VectorF = Function::Create(FTy, Function::ExternalLinkage, VFnName, M);
+          assert(VectorF);
+        }
+
         for (unsigned Part = 0; Part < UF; ++Part) {
           SmallVector<Value *, 4> Args;
           for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
             VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
             Args.push_back(Arg[Part]);
           }
-          Type *Tys[] = {CI->getType()};
-          if (VF > 1)
-            Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
 
-          Function *F = Intrinsic::getDeclaration(M, ID, Tys);
-          Entry[Part] = Builder.CreateCall(F, Args);
+          Entry[Part] = Builder.CreateCall(VectorF, Args);;
         }
-        break;
+      } else {
+        // We have a function call that has no vector form - we must scalarize
+        // it.
+        // FIXME: We could check if it has a vector form for smaller values of
+        // VF, then chain them together instead of bailing and being fully
+        // scalar.
+        bool IsVoidTy = CI->getType()->isVoidTy();
+
+        for (unsigned UPart = 0; UPart < UF; ++UPart) {
+          Value *VRet = NULL;
+          // If we have to return something, start with an undefined vector and
+          // fill it in element by element.
+          if (!IsVoidTy)
+            VRet = UndefValue::get(VectorType::get(CI->getType(), VF));
+
+          for (unsigned VPart = 0; VPart < VF; ++VPart) {
+            
+            SmallVector<Value *, 4> Args;
+            for (unsigned I = 0, IE = CI->getNumArgOperands(); I != IE; ++I) {
+              Value *Operand = CI->getArgOperand(I);
+
+              Instruction *Inst = dyn_cast<Instruction>(Operand);
+              if (!Inst || Legal->isUniformAfterVectorization(Inst)) {
+                // Uniform variable - just use the original scalar argument.
+                Args.push_back(Operand);
+              } else {
+                // Non-uniform.
+                assert(WidenMap.has(Operand) &&
+                       "Non-uniform values must be in WidenMap!");
+                Value *VArg = WidenMap.get(Operand)[UPart];
+                Value *Arg =
+                  Builder.CreateExtractElement(VArg,
+                                               Builder.getInt32(VPart));
+                Args.push_back(Arg);
+              }
+            }
+            
+            Value *NewCI = Builder.CreateCall(CI->getCalledFunction(), Args);
+
+            if (!IsVoidTy)
+              VRet = Builder.CreateInsertElement(VRet, NewCI,
+                                                 Builder.getInt32(VPart));
+          }
+          Entry[UPart] = VRet;
+        }
+
       }
       break;
     }
@@ -3468,11 +3553,16 @@
         return false;
       }// end of PHI handling
 
-      // We still don't handle functions. However, we can ignore dbg intrinsic
-      // calls and we do handle certain intrinsic and libm functions.
+      // We handle calls that:
+      //   * Are debug info intrinsics.
+      //   * Have a mapping to an IR intrinsic.
+      //   * Have a vector version available.
+
       CallInst *CI = dyn_cast<CallInst>(it);
-      if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI)) {
-        DEBUG(dbgs() << "LV: Found a call site.\n");
+      if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI)
+          && !(CI->getCalledFunction() && TLI &&
+               TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {
+        DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");
         return false;
       }
 
@@ -4405,6 +4495,12 @@
         if (Call && getIntrinsicIDForCall(Call, TLI))
           continue;
 
+        // If the function has an explicit vectorized counterpart, we can safely
+        // assume that it can be vectorized.
+        if (Call && Call->getCalledFunction() &&
+            TLI->isFunctionVectorizable(Call->getCalledFunction()->getName()))
+          continue;
+
         LoadInst *Ld = dyn_cast<LoadInst>(it);
         if (!Ld) return false;
         if (!Ld->isSimple() && !IsAnnotatedParallel) {
@@ -5609,13 +5705,16 @@
   }
   case Instruction::Call: {
     CallInst *CI = cast<CallInst>(I);
-    Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);
-    assert(ID && "Not an intrinsic call!");
+    Function *F = CI->getCalledFunction();
     Type *RetTy = ToVectorTy(CI->getType(), VF);
     SmallVector<Type*, 4> Tys;
     for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
       Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
-    return TTI.getIntrinsicInstrCost(ID, RetTy, Tys);
+
+    unsigned Cost = TTI.getCallInstrCost(F, RetTy, Tys);
+    if (Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI))
+      return std::min(Cost, TTI.getIntrinsicInstrCost(ID, RetTy, Tys));
+    return Cost;
   }
   default: {
     // We are scalarizing the instruction. Return the cost of the scalar
@@ -5642,12 +5741,6 @@
   }// end of switch.
 }
 
-Type* LoopVectorizationCostModel::ToVectorTy(Type *Scalar, unsigned VF) {
-  if (Scalar->isVoidTy() || VF == 1)
-    return Scalar;
-  return VectorType::get(Scalar, VF);
-}
-
 char LoopVectorize::ID = 0;
 static const char lv_name[] = "Loop Vectorization";
 INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
Index: lib/Analysis/TargetTransformInfo.cpp
===================================================================
--- lib/Analysis/TargetTransformInfo.cpp	(revision 204319)
+++ lib/Analysis/TargetTransformInfo.cpp	(working copy)
@@ -215,6 +215,13 @@
   return PrevTTI->getIntrinsicInstrCost(ID, RetTy, Tys);
 }
 
+unsigned
+TargetTransformInfo::getCallInstrCost(Function *F,
+                                      Type *RetTy,
+                                      ArrayRef<Type *> Tys) const {
+  return PrevTTI->getCallInstrCost(F, RetTy, Tys);
+}
+
 unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {
   return PrevTTI->getNumberOfParts(Tp);
 }
@@ -600,6 +607,12 @@
     return 1;
   }
 
+  unsigned getCallInstrCost(Function *F,
+                            Type *RetTy,
+                            ArrayRef<Type*> Tys) const {
+    return 10;
+  }
+
   unsigned getNumberOfParts(Type *Tp) const override {
     return 0;
   }
Index: lib/Target/TargetLibraryInfo.cpp
===================================================================
--- lib/Target/TargetLibraryInfo.cpp	(revision 204319)
+++ lib/Target/TargetLibraryInfo.cpp	(working copy)
@@ -663,6 +663,17 @@
     TLI.setUnavailable(LibFunc::statvfs64);
     TLI.setUnavailable(LibFunc::tmpfile64);
   }
+
+  // The Accelerate library adds vectorizable variants of many
+  // standard library functions.
+  // FIXME: Make the following list complete.
+  if (T.getEnvironmentName() == "Accelerate") {
+    const TargetLibraryInfo::VecDesc VecFuncs[] = {
+      {"expf", "vexpf", 4},
+      {"llvm.exp.f32", "vexpf", 4}
+    };
+    TLI.addVectorizableFunctions(VecFuncs);
+  }
 }
 
 
@@ -686,23 +697,17 @@
   CustomNames = TLI.CustomNames;
 }
 
-namespace {
-struct StringComparator {
-  /// Compare two strings and return true if LHS is lexicographically less than
-  /// RHS. Requires that RHS doesn't contain any zero bytes.
-  bool operator()(const char *LHS, StringRef RHS) const {
-    // Compare prefixes with strncmp. If prefixes match we know that LHS is
-    // greater or equal to RHS as RHS can't contain any '\0'.
-    return std::strncmp(LHS, RHS.data(), RHS.size()) < 0;
-  }
+static StringRef sanitizeFunctionName(StringRef funcName) {
+  // Filter out empty names and names containing null bytes, those can't be in
+  // our table.
+  if (funcName.empty() || funcName.find('\0') != StringRef::npos)
+    return StringRef();
 
-  // Provided for compatibility with MSVC's debug mode.
-  bool operator()(StringRef LHS, const char *RHS) const { return LHS < RHS; }
-  bool operator()(StringRef LHS, StringRef RHS) const { return LHS < RHS; }
-  bool operator()(const char *LHS, const char *RHS) const {
-    return std::strcmp(LHS, RHS) < 0;
-  }
-};
+  // Check for \01 prefix that is used to mangle __asm declarations and
+  // strip it if present.
+  if (funcName.front() == '\01')
+    funcName = funcName.substr(1);
+  return funcName;
 }
 
 bool TargetLibraryInfo::getLibFunc(StringRef funcName,
@@ -710,16 +715,13 @@
   const char **Start = &StandardNames[0];
   const char **End = &StandardNames[LibFunc::NumLibFuncs];
 
-  // Filter out empty names and names containing null bytes, those can't be in
-  // our table.
-  if (funcName.empty() || funcName.find('\0') != StringRef::npos)
+  funcName = sanitizeFunctionName(funcName);
+  if (funcName.empty())
     return false;
 
-  // Check for \01 prefix that is used to mangle __asm declarations and
-  // strip it if present.
-  if (funcName.front() == '\01')
-    funcName = funcName.substr(1);
-  const char **I = std::lower_bound(Start, End, funcName, StringComparator());
+  const char **I = std::lower_bound(Start, End, funcName, [](const char *LHS, StringRef RHS) {
+      return std::strncmp(LHS, RHS.data(), RHS.size()) < 0;
+    });
   if (I != End && *I == funcName) {
     F = (LibFunc::Func)(I - Start);
     return true;
@@ -732,3 +734,77 @@
 void TargetLibraryInfo::disableAllFunctions() {
   memset(AvailableArray, 0, sizeof(AvailableArray));
 }
+
+void TargetLibraryInfo::addVectorizableFunctions(ArrayRef<VecDesc> Fns) {
+  VectorDescs.insert(VectorDescs.end(), Fns.begin(), Fns.end());
+  std::sort(VectorDescs.begin(), VectorDescs.end(),
+            [](const VecDesc &LHS, const VecDesc &RHS) {
+              return std::strncmp(LHS.ScalarFnName, RHS.ScalarFnName,
+                                  std::strlen(RHS.ScalarFnName)) < 0;
+            });
+
+  ScalarDescs.insert(ScalarDescs.end(), Fns.begin(), Fns.end());
+  std::sort(ScalarDescs.begin(), ScalarDescs.end(),
+            [](const VecDesc &LHS, const VecDesc &RHS) {
+              return std::strncmp(LHS.VectorFnName, RHS.VectorFnName,
+                                  std::strlen(RHS.VectorFnName)) < 0;
+            });
+}
+
+bool TargetLibraryInfo::isFunctionVectorizable(StringRef funcName) const {
+  funcName = sanitizeFunctionName(funcName);
+  if (funcName.empty())
+    return false;
+
+  std::vector<VecDesc>::const_iterator I =
+    std::lower_bound(VectorDescs.begin(),
+                     VectorDescs.end(),
+                     funcName,
+                     [](const VecDesc &LHS, StringRef S) {
+                       return std::strncmp(LHS.ScalarFnName, S.data(),
+                                           S.size()) < 0;
+                     });
+  return I != VectorDescs.end();
+}
+
+StringRef TargetLibraryInfo::getVectorizedFunction(StringRef F,
+                                                   unsigned VF) const {
+  F = sanitizeFunctionName(F);
+  if (F.empty())
+    return F;
+
+  std::vector<VecDesc>::const_iterator I =
+    std::lower_bound(VectorDescs.begin(),
+                     VectorDescs.end(),
+                     F,
+                     [](const VecDesc &LHS, StringRef S) {
+                       return std::strncmp(LHS.ScalarFnName, S.data(),
+                                           S.size()) < 0;
+                     });
+  while (I != VectorDescs.end() && StringRef(I->ScalarFnName) == F) {
+    if (I->VectorizationFactor == VF)
+      return I->VectorFnName;
+    ++I;
+  }
+  return StringRef();
+}
+
+StringRef TargetLibraryInfo::getScalarizedFunction(StringRef F,
+                                                   unsigned &VF) const {
+  F = sanitizeFunctionName(F);
+  if (F.empty())
+    return F;
+
+  std::vector<VecDesc>::const_iterator I =
+    std::lower_bound(ScalarDescs.begin(),
+                     ScalarDescs.end(),
+                     F,
+                     [](const VecDesc &LHS, StringRef S) {
+                       return std::strncmp(LHS.VectorFnName, S.data(),
+                                           S.size()) < 0;
+                     });
+  if (I != VectorDescs.end())
+    return StringRef();
+  VF = I->VectorizationFactor;
+  return I->ScalarFnName;
+}
Index: lib/CodeGen/BasicTargetTransformInfo.cpp
===================================================================
--- lib/CodeGen/BasicTargetTransformInfo.cpp	(revision 204319)
+++ lib/CodeGen/BasicTargetTransformInfo.cpp	(working copy)
@@ -16,8 +16,10 @@
 //===----------------------------------------------------------------------===//
 
 #define DEBUG_TYPE "basictti"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/CodeGen/Passes.h"
-#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/Target/TargetLibraryInfo.h"
 #include "llvm/Target/TargetLowering.h"
 #include <utility>
 using namespace llvm;
@@ -105,6 +107,8 @@
                            unsigned AddressSpace) const override;
   unsigned getIntrinsicInstrCost(Intrinsic::ID, Type *RetTy,
                                  ArrayRef<Type*> Tys) const override;
+  unsigned getCallInstrCost(Function *F, Type *RetTy,
+                            ArrayRef<Type*> Tys) const override;
   unsigned getNumberOfParts(Type *Tp) const override;
   unsigned getAddressComputationCost( Type *Ty, bool IsComplex) const override;
   unsigned getReductionCost(unsigned Opcode, Type *Ty,
@@ -434,7 +438,7 @@
     for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
       if (Tys[i]->isVectorTy()) {
         ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
-        ScalarCalls = std::max(ScalarCalls, RetTy->getVectorNumElements());
+        ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
       }
     }
 
@@ -493,13 +497,48 @@
     unsigned Num = RetTy->getVectorNumElements();
     unsigned Cost = TopTTI->getIntrinsicInstrCost(IID, RetTy->getScalarType(),
                                                   Tys);
-    return 10 * Cost * Num;
+    unsigned ScalarizationCost = 0;
+    if (RetTy->isVectorTy())
+      ScalarizationCost = getScalarizationOverhead(RetTy, true, false);
+    for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
+      if (Tys[i]->isVectorTy())
+        ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
+    }
+
+    return Cost * Num + ScalarizationCost;
   }
 
   // This is going to be turned into a library call, make it expensive.
   return 10;
 }
 
+unsigned BasicTTI::getCallInstrCost(Function *F, Type *RetTy,
+                                    ArrayRef<Type *> Tys) const {
+
+  // Scalar function calls are always expensive.
+  if (!RetTy->isVectorTy())
+    return 10;
+
+  const TargetLibraryInfo *TLI = getAnalysisIfAvailable<TargetLibraryInfo>();
+
+  // Functions with a vector form are no more expensive than a scalar call.
+  if (TLI && TLI->isFunctionVectorizable(F->getName(),
+                                         RetTy->getVectorNumElements()))
+    return 10;
+    
+  // We have to scalarize this function call. Estimate the cost.
+  unsigned ScalarizationCost = getScalarizationOverhead(RetTy, true, false);
+  unsigned ScalarCalls = RetTy->getVectorNumElements();
+  for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
+    if (Tys[i]->isVectorTy()) {
+      ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
+      ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
+    }
+  }
+
+  return ScalarCalls * 10 + ScalarizationCost;
+}
+
 unsigned BasicTTI::getNumberOfParts(Type *Tp) const {
   std::pair<unsigned, MVT> LT = getTLI()->getTypeLegalizationCost(Tp);
   return LT.first;