RFC: Enable vectorization of call instructions in the loop vectorizer

Mon Mar 17 07:38:20 PDT 2014

Hi Arnold,

Sorry for the large delay in this - I've been working on this in my spare
time and haven't had much of that lately! :)

This version of the patch:

  * Addresses your three points in your previous email.
  * Adds support for the Accelerate library, but I only added support for
one function in it (expf) for testing purposes. There is a fixme for
someone with more Apple knowledge and ability to test than me to fill in
the rest.
  * Updates to ToT and updates TargetLibraryInfo to use C++11 lambdas in
std::lower_bound rather than functors.

Does it look better?

Cheers,

James


On 17 January 2014 17:22, James Molloy <james at jamesmolloy.co.uk> wrote:

> Awesome, thanks Arnold! Very clear now.
>
>
> On 17 January 2014 16:45, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:
>
>>
>> On Jan 17, 2014, at 2:59 AM, James Molloy <james at jamesmolloy.co.uk>
>> wrote:
>>
>> > Hi Arnold,
>> >
>> > > First, we are going to have the situation where there exists an
>> intrinsic ID for a library function (many math library functions have an
>> intrinsic version: expf -> llvm.exp.32 for example). As a consequence
>> "getIntrinsicIDForCall" will return it. In this case we can have both: a
>> vectorized library function version and an intrinsic function that maybe
>> slower or faster. In such a case the cost model has to decide which one to
>> pick. This means we have to query the cost model which one is cheaper in
>> two places: when get the instruction cost and when we vectorize the call.
>> >
>> > Sure, I will address this.
>> >
>> > > Second, the way we test this. [snip]
>> >
>> > This is very sensible. The only reason I didn't go down this route to
>> start with was that I didn't know of an available library (like Accelerate)
>> and didn't want to add testing/dummy code in tree. Thanks for pointing me
>> at Accelerate - that'll give me a real library to (semi) implement and test.
>> >
>> > > This brings me to issue three. You are currently using
>> TTI->getCallCost() which is not meant to be used with the vectorizer. We
>> should create a getCallInstrCost() function similar to the
>> "getIntrinsicInstrCost" function we already have.
>> > >
>> > > BasicTTI::getCallInstrCost should query TLI->isFunctionVectorizable()
>> and return a sensible value in this case (one that is lower than a
>> scalarized intrinsic lowered as lib call).
>> >
>> > I don't understand the difference between getIntrinsicCost and
>> getIntrinsicInstrCost. They both take the same arguments (but return
>> different values), and the doxygen docstring does not describe the action
>> in enough detail to discern what the required behaviour is.
>> >
>> > Could you please tell me? (and I'll update the docstrings while I'm at
>> it).
>>
>> Sure, TargetTransformInfo is split into two "cost" metrics:
>>
>> * Generic target information which returns its cost in terms of
>> "TargetCostConstants":
>>
>>   /// \name Generic Target Information
>>   /// @{
>>
>>   /// \brief Underlying constants for 'cost' values in this interface.
>>   ///
>>   /// Many APIs in this interface return a cost. This enum defines the
>>   /// fundamental values that should be used to interpret (and produce)
>> those
>>   /// costs. The costs are returned as an unsigned rather than a member
>> of this
>>   /// enumeration because it is expected that the cost of one IR
>> instruction
>>   /// may have a multiplicative factor to it or otherwise won't fit
>> directly
>>   /// into the enum. Moreover, it is common to sum or average costs which
>> works
>>   /// better as simple integral values. Thus this enum only provides
>> constants.
>>   ...
>>   /// @}
>>
>> This api is used by the inliner (getUserCost) to estimate the cost (size)
>> of instructions.
>>
>> * Throughput estimate for the vectorizer. This api attempts to estimate
>> (very crudely on a instruction per instruction basis) the throughput of
>> instructions (since we automatically infer most values using
>> TargetLoweringInfo, and we have to do this from IR this is not going to be
>> very accurate ...).
>>
>>  /// \name Vector Target Information
>>  /// @{
>>  ...
>>  /// \return The expected cost of arithmetic ops, such as mul, xor, fsub,
>> et
>>  virtual unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,
>>  ...
>>  /// @}
>>
>> At a high level, this api tries to answer the question: What does this
>> instruction cost in a scalar form ("expf", f32). Or what does this
>> instruction cost in a vectorized form ("expf", <4 x float>).
>>
>> BasicTTI::getIntrinsicInstrCost() assumes a cost of 1 for intrinsics that
>> have a corresponding ISA instruction
>> (TLoweringI->isOperationLegalOrPromote(ISD:FEXP) returns true), a cost of
>> 10 for the ones that don't and then we also incorporate things like type
>> legalization costs, and overhead if we vectorize.
>>
>> For the new BasicTTI::getCallInstrCost(Function, RetTy, ArgTys) we would
>> also return 10 for scalar versions of the function (RetTy->isVectorTy() ==
>> false).
>> For vector queries (RetTy->isVectorTy()==true), if there is a a
>> TLibInfo->isVectorizableFunction(Function->getCalledFunction->getName(),
>> RetTy->getVectorNumElements()) we should also return 10. Otherwise, we
>> estimate the cost of scalarization just like we do in
>> getIntrinsicInstrCost. This will guarantee that the vectorize library
>> function call (Cost = 10) will be chosen over the intrinsic lowered to a
>> sequence of scalarized lib calls (Cost = 10 * VF * ...).
>>
>> Then, in LoopVectorizationCostModel::getInstructionCost() you would query
>> both (if getInstrinsicIDForCall returns an id) apis and return the smallest:
>>
>>  case Call:
>>    CallInst *CI = cast<CallInst>(I);
>>
>>
>>     Type *RetTy = ToVectorTy(CI->getType(), VF);
>>     SmallVector<Type*, 4> Tys;
>>     for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
>>       Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
>>     unsigned LibFuncCallCost =
>> TTI.getCallInstrCost(CI->getCalledFunction(), RetTy, Tys);
>>
>>     if (unsigned ID = getIntrinsicIDForCall(CI, TLI))
>>       return std::min(LibFuncCallCost, TTI.getIntrinsicInstrCost(ID,
>> RetTy, Tys));
>>    return LibFuncCallCost;
>>
>>
>> Thanks,
>> Arnold
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140317/28a995d5/attachment.html>
-------------- next part --------------
Index: include/llvm/Analysis/TargetTransformInfo.h
===================================================================

--- include/llvm/Analysis/TargetTransformInfo.h	(revision 204039)
+++ include/llvm/Analysis/TargetTransformInfo.h	(working copy)
@@ -389,6 +389,10 @@
   virtual unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
                                          ArrayRef<Type *> Tys) const;
 
+  /// \returns The cost of Call instructions.
+  virtual unsigned getCallInstrCost(Function *F, Type *RetTy,
+                                    ArrayRef<Type *> Tys) const;
+
   /// \returns The number of pieces into which the provided type must be
   /// split during legalization. Zero is returned when the answer is unknown.
   virtual unsigned getNumberOfParts(Type *Tp) const;
Index: include/llvm/Target/TargetLibraryInfo.h
===================================================================
--- include/llvm/Target/TargetLibraryInfo.h	(revision 204039)
+++ include/llvm/Target/TargetLibraryInfo.h	(working copy)
@@ -11,6 +11,7 @@
 #define LLVM_TARGET_TARGETLIBRARYINFO_H
 
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/ArrayRef.h"
 #include "llvm/Pass.h"
 
 namespace llvm {
@@ -673,10 +674,26 @@
 /// library functions are available for the current target, and allows a
 /// frontend to disable optimizations through -fno-builtin etc.
 class TargetLibraryInfo : public ImmutablePass {
+public:
+  /// VecDesc - Describes a possible vectorization of a function.
+  /// Function 'VectorFnName' is equivalent to 'ScalarFnName' vectorized
+  /// by a factor 'VectorizationFactor'.
+  struct VecDesc {
+    const char *ScalarFnName;
+    const char *VectorFnName;
+    unsigned VectorizationFactor;
+  };
+
+private:
   virtual void anchor();
   unsigned char AvailableArray[(LibFunc::NumLibFuncs+3)/4];
   llvm::DenseMap<unsigned, std::string> CustomNames;
   static const char* StandardNames[LibFunc::NumLibFuncs];
+  /// Vectorization descriptors - sorted by ScalarFnName.
+  std::vector<VecDesc> VectorDescs;
+  /// Scalarization descriptors - same content as VectorDescs but sorted based
+  /// on VectorFnName rather than ScalarFnName.
+  std::vector<VecDesc> ScalarDescs;
 
   enum AvailabilityState {
     StandardName = 3, // (memset to all ones)
@@ -772,6 +789,38 @@
   /// disableAllFunctions - This disables all builtins, which is used for
   /// options like -fno-builtin.
   void disableAllFunctions();
+
+  /// addVectorizableFunctions - Add a set of scalar -> vector mappings,
+  /// queryable via getVectorizedFunction and getScalarizedFunction.
+  void addVectorizableFunctions(ArrayRef<VecDesc> Fns);
+
+  /// isFunctionVectorizable - Return true if the function F has a
+  /// vector equivalent with vectorization factor VF.
+  bool isFunctionVectorizable(StringRef F, unsigned VF) const {
+    return !getVectorizedFunction(F, VF).empty();
+  }
+
+  /// isFunctionVectorizable - Return true if the function F has a
+  /// vector equivalent with any vectorization factor.
+  bool isFunctionVectorizable(StringRef F) const;
+
+  /// getVectorizedFunction - Return the name of the equivalent of 
+  /// F, vectorized with factor VF. If no such mapping exists,
+  /// return the empty string.
+  StringRef getVectorizedFunction(StringRef F, unsigned VF) const;
+
+  /// isFunctionScalarizable - Return true if the function F has a
+  /// scalar equivalent, and set VF to be the vectorization factor.
+  bool isFunctionScalarizable(StringRef F, unsigned &VF) const {
+    return !getScalarizedFunction(F, VF).empty();
+  }
+
+  /// getScalarizedFunction - Return the name of the equivalent of 
+  /// F, scalarized. If no such mapping exists, return the empty string.
+  ///
+  /// Set VF to the vectorization factor.
+  StringRef getScalarizedFunction(StringRef F, unsigned &VF) const;
+
 };
 
 } // end namespace llvm
Index: test/Transforms/LoopVectorize/funcall.ll
===================================================================
--- test/Transforms/LoopVectorize/funcall.ll	(revision 204039)
+++ test/Transforms/LoopVectorize/funcall.ll	(working copy)
@@ -1,6 +1,7 @@
 ; RUN: opt -S -loop-vectorize -force-vector-width=2 -force-vector-unroll=1 < %s | FileCheck %s
 
 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
+target triple = "armv7-apple-macos-Accelerate"
 
 ; Make sure we can vectorize loops with functions to math library functions.
 ; They might read the rounding mode but we are only vectorizing loops that
Index: test/Transforms/LoopVectorize/libcall.ll
===================================================================
--- test/Transforms/LoopVectorize/libcall.ll	(revision 0)
+++ test/Transforms/LoopVectorize/libcall.ll	(working copy)
@@ -0,0 +1,55 @@
+; RUN: opt < %s  -loop-vectorize -S | FileCheck %s
+
+target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
+target triple = "armv7-apple-macos-Accelerate"
+
+;CHECK-LABEL: @exp_intrinsic_f32(
+;CHECK: vexp
+;CHECK: ret void
+define void @exp_intrinsic_f32(i32 %n, float* noalias %y, float* noalias %x) nounwind uwtable {
+entry:
+  %cmp6 = icmp sgt i32 %n, 0
+  br i1 %cmp6, label %for.body, label %for.end
+
+for.body:                                         ; preds = %entry, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
+  %arrayidx = getelementptr inbounds float* %y, i64 %indvars.iv
+  %0 = load float* %arrayidx, align 4
+  %call = tail call float @llvm.exp.f32(float %0) nounwind readnone
+  %arrayidx2 = getelementptr inbounds float* %x, i64 %indvars.iv
+  store float %call, float* %arrayidx2, align 4
+  %indvars.iv.next = add i64 %indvars.iv, 1
+  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+  %exitcond = icmp eq i32 %lftr.wideiv, %n
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body, %entry
+  ret void
+}
+
+;CHECK-LABEL: @exp_libcall_f32(
+;CHECK: vexp
+;CHECK: ret void
+define void @exp_libcall_f32(i32 %n, float* noalias %y, float* noalias %x) nounwind uwtable {
+entry:
+  %cmp6 = icmp sgt i32 %n, 0
+  br i1 %cmp6, label %for.body, label %for.end
+
+for.body:                                         ; preds = %entry, %for.body
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
+  %arrayidx = getelementptr inbounds float* %y, i64 %indvars.iv
+  %0 = load float* %arrayidx, align 4
+  %call = tail call float @expf(float %0) nounwind readnone
+  %arrayidx2 = getelementptr inbounds float* %x, i64 %indvars.iv
+  store float %call, float* %arrayidx2, align 4
+  %indvars.iv.next = add i64 %indvars.iv, 1
+  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+  %exitcond = icmp eq i32 %lftr.wideiv, %n
+  br i1 %exitcond, label %for.end, label %for.body
+
+for.end:                                          ; preds = %for.body, %entry
+  ret void
+}
+
+declare float @llvm.exp.f32(float) nounwind readnone
+declare float @expf(float) nounwind readnone
\ No newline at end of file
Index: test/Analysis/CostModel/X86/intrinsic-cost.ll
===================================================================
--- test/Analysis/CostModel/X86/intrinsic-cost.ll	(revision 204039)
+++ test/Analysis/CostModel/X86/intrinsic-cost.ll	(working copy)
@@ -22,7 +22,7 @@
   ret void
 
 ; CORE2: Printing analysis 'Cost Model Analysis' for function 'test1':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 40 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
 
 ; COREI7: Printing analysis 'Cost Model Analysis' for function 'test1':
 ; COREI7: Cost Model: Found an estimated cost of 1 for instruction:   %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
@@ -50,7 +50,7 @@
   ret void
 
 ; CORE2: Printing analysis 'Cost Model Analysis' for function 'test2':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 40 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
 
 ; COREI7: Printing analysis 'Cost Model Analysis' for function 'test2':
 ; COREI7: Cost Model: Found an estimated cost of 1 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
Index: lib/Analysis/TargetTransformInfo.cpp
===================================================================
--- lib/Analysis/TargetTransformInfo.cpp	(revision 204039)
+++ lib/Analysis/TargetTransformInfo.cpp	(working copy)
@@ -215,6 +215,13 @@
   return PrevTTI->getIntrinsicInstrCost(ID, RetTy, Tys);
 }
 
+unsigned
+TargetTransformInfo::getCallInstrCost(Function *F,
+                                      Type *RetTy,
+                                      ArrayRef<Type *> Tys) const {
+  return PrevTTI->getCallInstrCost(F, RetTy, Tys);
+}
+
 unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {
   return PrevTTI->getNumberOfParts(Tp);
 }
@@ -600,6 +607,12 @@
     return 1;
   }
 
+  unsigned getCallInstrCost(Function *F,
+                            Type *RetTy,
+                            ArrayRef<Type*> Tys) const {
+    return 10;
+  }
+
   unsigned getNumberOfParts(Type *Tp) const override {
     return 0;
   }
Index: lib/Target/TargetLibraryInfo.cpp
===================================================================
--- lib/Target/TargetLibraryInfo.cpp	(revision 204039)
+++ lib/Target/TargetLibraryInfo.cpp	(working copy)
@@ -663,6 +663,17 @@
     TLI.setUnavailable(LibFunc::statvfs64);
     TLI.setUnavailable(LibFunc::tmpfile64);
   }
+
+  // The Accelerate library adds vectorizable variants of many
+  // standard library functions.
+  // FIXME: Make the following list complete.
+  if (T.getEnvironmentName() == "Accelerate") {
+    const TargetLibraryInfo::VecDesc VecFuncs[] = {
+      {"expf", "vexpf", 4},
+      {"llvm.exp.f32", "vexpf", 4}
+    };
+    TLI.addVectorizableFunctions(VecFuncs);
+  }
 }
 
 
@@ -686,23 +697,17 @@
   CustomNames = TLI.CustomNames;
 }
 
-namespace {
-struct StringComparator {
-  /// Compare two strings and return true if LHS is lexicographically less than
-  /// RHS. Requires that RHS doesn't contain any zero bytes.
-  bool operator()(const char *LHS, StringRef RHS) const {
-    // Compare prefixes with strncmp. If prefixes match we know that LHS is
-    // greater or equal to RHS as RHS can't contain any '\0'.
-    return std::strncmp(LHS, RHS.data(), RHS.size()) < 0;
-  }
+static StringRef sanitizeFunctionName(StringRef funcName) {
+  // Filter out empty names and names containing null bytes, those can't be in
+  // our table.
+  if (funcName.empty() || funcName.find('\0') != StringRef::npos)
+    return StringRef();
 
-  // Provided for compatibility with MSVC's debug mode.
-  bool operator()(StringRef LHS, const char *RHS) const { return LHS < RHS; }
-  bool operator()(StringRef LHS, StringRef RHS) const { return LHS < RHS; }
-  bool operator()(const char *LHS, const char *RHS) const {
-    return std::strcmp(LHS, RHS) < 0;
-  }
-};
+  // Check for \01 prefix that is used to mangle __asm declarations and
+  // strip it if present.
+  if (funcName.front() == '\01')
+    funcName = funcName.substr(1);
+  return funcName;
 }
 
 bool TargetLibraryInfo::getLibFunc(StringRef funcName,
@@ -710,16 +715,13 @@
   const char **Start = &StandardNames[0];
   const char **End = &StandardNames[LibFunc::NumLibFuncs];
 
-  // Filter out empty names and names containing null bytes, those can't be in
-  // our table.
-  if (funcName.empty() || funcName.find('\0') != StringRef::npos)
+  funcName = sanitizeFunctionName(funcName);
+  if (funcName.empty())
     return false;
 
-  // Check for \01 prefix that is used to mangle __asm declarations and
-  // strip it if present.
-  if (funcName.front() == '\01')
-    funcName = funcName.substr(1);
-  const char **I = std::lower_bound(Start, End, funcName, StringComparator());
+  const char **I = std::lower_bound(Start, End, funcName, [](const char *LHS, StringRef RHS) {
+      return std::strncmp(LHS, RHS.data(), RHS.size()) < 0;
+    });
   if (I != End && *I == funcName) {
     F = (LibFunc::Func)(I - Start);
     return true;
@@ -732,3 +734,77 @@
 void TargetLibraryInfo::disableAllFunctions() {
   memset(AvailableArray, 0, sizeof(AvailableArray));
 }
+
+void TargetLibraryInfo::addVectorizableFunctions(ArrayRef<VecDesc> Fns) {
+  VectorDescs.insert(VectorDescs.end(), Fns.begin(), Fns.end());
+  std::sort(VectorDescs.begin(), VectorDescs.end(),
+            [](const VecDesc &LHS, const VecDesc &RHS) {
+              return std::strncmp(LHS.ScalarFnName, RHS.ScalarFnName,
+                                  std::strlen(RHS.ScalarFnName)) < 0;
+            });
+
+  ScalarDescs.insert(ScalarDescs.end(), Fns.begin(), Fns.end());
+  std::sort(ScalarDescs.begin(), ScalarDescs.end(),
+            [](const VecDesc &LHS, const VecDesc &RHS) {
+              return std::strncmp(LHS.VectorFnName, RHS.VectorFnName,
+                                  std::strlen(RHS.VectorFnName)) < 0;
+            });
+}
+
+bool TargetLibraryInfo::isFunctionVectorizable(StringRef funcName) const {
+  funcName = sanitizeFunctionName(funcName);
+  if (funcName.empty())
+    return false;
+
+  std::vector<VecDesc>::const_iterator I =
+    std::lower_bound(VectorDescs.begin(),
+                     VectorDescs.end(),
+                     funcName,
+                     [](const VecDesc &LHS, StringRef S) {
+                       return std::strncmp(LHS.ScalarFnName, S.data(),
+                                           S.size()) < 0;
+                     });
+  return I != VectorDescs.end();
+}
+
+StringRef TargetLibraryInfo::getVectorizedFunction(StringRef F,
+                                                   unsigned VF) const {
+  F = sanitizeFunctionName(F);
+  if (F.empty())
+    return F;
+
+  std::vector<VecDesc>::const_iterator I =
+    std::lower_bound(VectorDescs.begin(),
+                     VectorDescs.end(),
+                     F,
+                     [](const VecDesc &LHS, StringRef S) {
+                       return std::strncmp(LHS.ScalarFnName, S.data(),
+                                           S.size()) < 0;
+                     });
+  while (I != VectorDescs.end() && StringRef(I->ScalarFnName) == F) {
+    if (I->VectorizationFactor == VF)
+      return I->VectorFnName;
+    ++I;
+  }
+  return StringRef();
+}
+
+StringRef TargetLibraryInfo::getScalarizedFunction(StringRef F,
+                                                   unsigned &VF) const {
+  F = sanitizeFunctionName(F);
+  if (F.empty())
+    return F;
+
+  std::vector<VecDesc>::const_iterator I =
+    std::lower_bound(ScalarDescs.begin(),
+                     ScalarDescs.end(),
+                     F,
+                     [](const VecDesc &LHS, StringRef S) {
+                       return std::strncmp(LHS.VectorFnName, S.data(),
+                                           S.size()) < 0;
+                     });
+  if (I != VectorDescs.end())
+    return StringRef();
+  VF = I->VectorizationFactor;
+  return I->ScalarFnName;
+}
Index: lib/CodeGen/BasicTargetTransformInfo.cpp
===================================================================
--- lib/CodeGen/BasicTargetTransformInfo.cpp	(revision 204039)
+++ lib/CodeGen/BasicTargetTransformInfo.cpp	(working copy)
@@ -16,8 +16,10 @@
 //===----------------------------------------------------------------------===//
 
 #define DEBUG_TYPE "basictti"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/CodeGen/Passes.h"
-#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/Target/TargetLibraryInfo.h"
 #include "llvm/Target/TargetLowering.h"
 #include <utility>
 using namespace llvm;
@@ -105,6 +107,8 @@
                            unsigned AddressSpace) const override;
   unsigned getIntrinsicInstrCost(Intrinsic::ID, Type *RetTy,
                                  ArrayRef<Type*> Tys) const override;
+  unsigned getCallInstrCost(Function *F, Type *RetTy,
+                            ArrayRef<Type*> Tys) const override;
   unsigned getNumberOfParts(Type *Tp) const override;
   unsigned getAddressComputationCost( Type *Ty, bool IsComplex) const override;
   unsigned getReductionCost(unsigned Opcode, Type *Ty,
@@ -434,7 +438,7 @@
     for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
       if (Tys[i]->isVectorTy()) {
         ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
-        ScalarCalls = std::max(ScalarCalls, RetTy->getVectorNumElements());
+        ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
       }
     }
 
@@ -493,13 +497,40 @@
     unsigned Num = RetTy->getVectorNumElements();
     unsigned Cost = TopTTI->getIntrinsicInstrCost(IID, RetTy->getScalarType(),
                                                   Tys);
-    return 10 * Cost * Num;
+    return Cost * Num;
   }
 
   // This is going to be turned into a library call, make it expensive.
   return 10;
 }
 
+unsigned BasicTTI::getCallInstrCost(Function *F, Type *RetTy,
+                                    ArrayRef<Type *> Tys) const {
+
+  // Scalar function calls are always expensive.
+  if (!RetTy->isVectorTy())
+    return 10;
+
+  const TargetLibraryInfo *TLI = getAnalysisIfAvailable<TargetLibraryInfo>();
+
+  // Functions with a vector form are no more expensive than a scalar call.
+  if (TLI && TLI->isFunctionVectorizable(F->getName(),
+                                         RetTy->getVectorNumElements()))
+    return 10;
+    
+  // We have to scalarize this function call. Estimate the cost.
+  unsigned ScalarizationCost = getScalarizationOverhead(RetTy, true, false);
+  unsigned ScalarCalls = RetTy->getVectorNumElements();
+  for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
+    if (Tys[i]->isVectorTy()) {
+      ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
+      ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
+    }
+  }
+
+  return ScalarCalls * 10 + ScalarizationCost;
+}
+
 unsigned BasicTTI::getNumberOfParts(Type *Tp) const {
   std::pair<unsigned, MVT> LT = getTLI()->getTypeLegalizationCost(Tp);
   return LT.first;
Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/LoopVectorize.cpp	(revision 204039)
+++ lib/Transforms/Vectorize/LoopVectorize.cpp	(working copy)
@@ -220,11 +220,12 @@
 public:
   InnerLoopVectorizer(Loop *OrigLoop, ScalarEvolution *SE, LoopInfo *LI,
                       DominatorTree *DT, const DataLayout *DL,
-                      const TargetLibraryInfo *TLI, unsigned VecWidth,
-                      unsigned UnrollFactor)
-      : OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), DL(DL), TLI(TLI),
-        VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()), Induction(0),
-        OldInduction(0), WidenMap(UnrollFactor), Legal(0) {}
+                      const TargetLibraryInfo *TLI,
+                      const TargetTransformInfo *TTI,
+                      unsigned VecWidth, unsigned UnrollFactor)
+    : OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), DL(DL), TLI(TLI), TTI(TTI),
+      VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()), Induction(0),
+      OldInduction(0), WidenMap(UnrollFactor), Legal(0) {}
 
   // Perform the actual loop widening (vectorization).
   void vectorize(LoopVectorizationLegality *L) {
@@ -382,6 +383,8 @@
   const DataLayout *DL;
   /// Target Library Info.
   const TargetLibraryInfo *TLI;
+  /// Target Transform Info.
+  const TargetTransformInfo *TTI;
 
   /// The vectorization SIMD factor to use. Each vector will have this many
   /// vector elements.
@@ -429,8 +432,9 @@
 public:
   InnerLoopUnroller(Loop *OrigLoop, ScalarEvolution *SE, LoopInfo *LI,
                     DominatorTree *DT, const DataLayout *DL,
-                    const TargetLibraryInfo *TLI, unsigned UnrollFactor) :
-    InnerLoopVectorizer(OrigLoop, SE, LI, DT, DL, TLI, 1, UnrollFactor) { }
+                    const TargetLibraryInfo *TLI, const TargetTransformInfo *TTI,
+                    unsigned UnrollFactor) :
+    InnerLoopVectorizer(OrigLoop, SE, LI, DT, DL, TLI, TTI, 1, UnrollFactor) { }
 
 private:
   void scalarizeInstruction(Instruction *Instr,
@@ -829,11 +833,6 @@
   /// width. Vector width of one means scalar.
   unsigned getInstructionCost(Instruction *I, unsigned VF);
 
-  /// A helper function for converting Scalar types to vector types.
-  /// If the incoming type is void, we return void. If the VF is 1, we return
-  /// the scalar type.
-  static Type* ToVectorTy(Type *Scalar, unsigned VF);
-
   /// Returns whether the instruction is a load or store and will be a emitted
   /// as a vector operation.
   bool isConsecutiveLoadOrStore(Instruction *I);
@@ -1146,11 +1145,11 @@
         return false;
       DEBUG(dbgs() << "LV: Trying to at least unroll the loops.\n");
       // We decided not to vectorize, but we may want to unroll.
-      InnerLoopUnroller Unroller(L, SE, LI, DT, DL, TLI, UF);
+      InnerLoopUnroller Unroller(L, SE, LI, DT, DL, TLI, TTI, UF);
       Unroller.vectorize(&LVL);
     } else {
       // If we decided that it is *legal* to vectorize the loop then do it.
-      InnerLoopVectorizer LB(L, SE, LI, DT, DL, TLI, VF.Width, UF);
+      InnerLoopVectorizer LB(L, SE, LI, DT, DL, TLI, TTI, VF.Width, UF);
       LB.vectorize(&LVL);
     }
 
@@ -1224,6 +1223,15 @@
   return SE->getSCEV(Ptr);
 }
 
+/// A helper function for converting Scalar types to vector types.
+/// If the incoming type is void, we return void. If the VF is 1, we return
+/// the scalar type.
+static Type* ToVectorTy(Type *Scalar, unsigned VF) {
+  if (Scalar->isVoidTy() || VF == 1)
+    return Scalar;
+  return VectorType::get(Scalar, VF);
+}
+
 void LoopVectorizationLegality::RuntimePointerCheck::insert(
     ScalarEvolution *SE, Loop *Lp, Value *Ptr, bool WritePtr, unsigned DepSetId,
     ValueToValueMap &Strides) {
@@ -3115,28 +3123,105 @@
 
       Module *M = BB->getParent()->getParent();
       CallInst *CI = cast<CallInst>(it);
+      StringRef FnName = CI->getCalledFunction()->getName();
+      Function *F = CI->getCalledFunction();
+      Type *RetTy = ToVectorTy(CI->getType(), VF);
+      SmallVector<Type*, 4> Tys;
+      for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
+        Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
+
       Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);
-      assert(ID && "Not an intrinsic call!");
-      switch (ID) {
-      case Intrinsic::lifetime_end:
-      case Intrinsic::lifetime_start:
-        scalarizeInstruction(it);
-        break;
-      default:
+      if (ID && TTI->getIntrinsicInstrCost(ID, RetTy, Tys) <
+          TTI->getCallInstrCost(F, RetTy, Tys)) {
+        switch (ID) {
+        case Intrinsic::lifetime_end:
+        case Intrinsic::lifetime_start:
+          scalarizeInstruction(it);
+          break;
+        default:
+          for (unsigned Part = 0; Part < UF; ++Part) {
+            SmallVector<Value *, 4> Args;
+            for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
+              VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
+              Args.push_back(Arg[Part]);
+            }
+            Type *Tys[] = {CI->getType()};
+            if (VF > 1)
+              Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
+
+            Function *F = Intrinsic::getDeclaration(M, ID, Tys);
+            Entry[Part] = Builder.CreateCall(F, Args);
+          }
+          break;
+        }
+      } else if (TLI && TLI->isFunctionVectorizable(FnName, VF)) {
+        // This is a function with a vector form.
+        StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
+        assert(!VFnName.empty());
+
+        Function *VectorF = M->getFunction(VFnName);
+        if (!VectorF) {
+          // Generate a declaration
+          FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
+          VectorF = Function::Create(FTy, Function::ExternalLinkage, VFnName, M);
+          assert(VectorF);
+        }
+
         for (unsigned Part = 0; Part < UF; ++Part) {
           SmallVector<Value *, 4> Args;
           for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
             VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
             Args.push_back(Arg[Part]);
           }
-          Type *Tys[] = {CI->getType()};
-          if (VF > 1)
-            Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
 
-          Function *F = Intrinsic::getDeclaration(M, ID, Tys);
-          Entry[Part] = Builder.CreateCall(F, Args);
+          Entry[Part] = Builder.CreateCall(VectorF, Args);;
         }
-        break;
+      } else {
+        // We have a function call that has no vector form - we must scalarize
+        // it.
+        // FIXME: We could check if it has a vector form for smaller values of
+        // VF, then chain them together instead of bailing and being fully
+        // scalar.
+        bool IsVoidTy = CI->getType()->isVoidTy();
+
+        for (unsigned UPart = 0; UPart < UF; ++UPart) {
+          Value *VRet = NULL;
+          // If we have to return something, start with an undefined vector and
+          // fill it in element by element.
+          if (!IsVoidTy)
+            VRet = UndefValue::get(VectorType::get(CI->getType(), VF));
+
+          for (unsigned VPart = 0; VPart < VF; ++VPart) {
+            
+            SmallVector<Value *, 4> Args;
+            for (unsigned I = 0, IE = CI->getNumArgOperands(); I != IE; ++I) {
+              Value *Operand = CI->getArgOperand(I);
+
+              Instruction *Inst = dyn_cast<Instruction>(Operand);
+              if (!Inst || Legal->isUniformAfterVectorization(Inst)) {
+                // Uniform variable - just use the original scalar argument.
+                Args.push_back(Operand);
+              } else {
+                // Non-uniform.
+                assert(WidenMap.has(Operand) &&
+                       "Non-uniform values must be in WidenMap!");
+                Value *VArg = WidenMap.get(Operand)[UPart];
+                Value *Arg =
+                  Builder.CreateExtractElement(VArg,
+                                               Builder.getInt32(VPart));
+                Args.push_back(Arg);
+              }
+            }
+            
+            Value *NewCI = Builder.CreateCall(CI->getCalledFunction(), Args);
+
+            if (!IsVoidTy)
+              VRet = Builder.CreateInsertElement(VRet, NewCI,
+                                                 Builder.getInt32(VPart));
+          }
+          Entry[UPart] = VRet;
+        }
+
       }
       break;
     }
@@ -3475,11 +3560,16 @@
         return false;
       }// end of PHI handling
 
-      // We still don't handle functions. However, we can ignore dbg intrinsic
-      // calls and we do handle certain intrinsic and libm functions.
+      // We handle calls that:
+      //   * Are debug info intrinsics.
+      //   * Have a mapping to an IR intrinsic.
+      //   * Have a vector version available.
+
       CallInst *CI = dyn_cast<CallInst>(it);
-      if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI)) {
-        DEBUG(dbgs() << "LV: Found a call site.\n");
+      if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI)
+          && !(CI->getCalledFunction() && TLI &&
+               TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {
+        DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");
         return false;
       }
 
@@ -4412,6 +4502,12 @@
         if (Call && getIntrinsicIDForCall(Call, TLI))
           continue;
 
+        // If the function has an explicit vectorized counterpart, we can safely
+        // assume that it can be vectorized.
+        if (Call && Call->getCalledFunction() &&
+            TLI->isFunctionVectorizable(Call->getCalledFunction()->getName()))
+          continue;
+
         LoadInst *Ld = dyn_cast<LoadInst>(it);
         if (!Ld) return false;
         if (!Ld->isSimple() && !IsAnnotatedParallel) {
@@ -5616,13 +5712,16 @@
   }
   case Instruction::Call: {
     CallInst *CI = cast<CallInst>(I);
-    Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);
-    assert(ID && "Not an intrinsic call!");
+    Function *F = CI->getCalledFunction();
     Type *RetTy = ToVectorTy(CI->getType(), VF);
     SmallVector<Type*, 4> Tys;
     for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
       Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
-    return TTI.getIntrinsicInstrCost(ID, RetTy, Tys);
+
+    unsigned Cost = TTI.getCallInstrCost(F, RetTy, Tys);
+    if (Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI))
+      return std::min(Cost, TTI.getIntrinsicInstrCost(ID, RetTy, Tys));
+    return Cost;
   }
   default: {
     // We are scalarizing the instruction. Return the cost of the scalar
@@ -5649,12 +5748,6 @@
   }// end of switch.
 }
 
-Type* LoopVectorizationCostModel::ToVectorTy(Type *Scalar, unsigned VF) {
-  if (Scalar->isVoidTy() || VF == 1)
-    return Scalar;
-  return VectorType::get(Scalar, VF);
-}
-
 char LoopVectorize::ID = 0;
 static const char lv_name[] = "Loop Vectorization";
 INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)