RFC: Enable vectorization of call instructions in the loop vectorizer
James Molloy
james at jamesmolloy.co.uk
Mon Mar 17 07:38:20 PDT 2014
Hi Arnold,
Sorry for the large delay in this - I've been working on this in my spare
time and haven't had much of that lately! :)
This version of the patch:
* Addresses your three points in your previous email.
* Adds support for the Accelerate library, but I only added support for
one function in it (expf) for testing purposes. There is a fixme for
someone with more Apple knowledge and ability to test than me to fill in
the rest.
* Updates to ToT and updates TargetLibraryInfo to use C++11 lambdas in
std::lower_bound rather than functors.
Does it look better?
Cheers,
James
On 17 January 2014 17:22, James Molloy <james at jamesmolloy.co.uk> wrote:
> Awesome, thanks Arnold! Very clear now.
>
>
> On 17 January 2014 16:45, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:
>
>>
>> On Jan 17, 2014, at 2:59 AM, James Molloy <james at jamesmolloy.co.uk>
>> wrote:
>>
>> > Hi Arnold,
>> >
>> > > First, we are going to have the situation where there exists an
>> intrinsic ID for a library function (many math library functions have an
>> intrinsic version: expf -> llvm.exp.32 for example). As a consequence
>> "getIntrinsicIDForCall" will return it. In this case we can have both: a
>> vectorized library function version and an intrinsic function that maybe
>> slower or faster. In such a case the cost model has to decide which one to
>> pick. This means we have to query the cost model which one is cheaper in
>> two places: when get the instruction cost and when we vectorize the call.
>> >
>> > Sure, I will address this.
>> >
>> > > Second, the way we test this. [snip]
>> >
>> > This is very sensible. The only reason I didn't go down this route to
>> start with was that I didn't know of an available library (like Accelerate)
>> and didn't want to add testing/dummy code in tree. Thanks for pointing me
>> at Accelerate - that'll give me a real library to (semi) implement and test.
>> >
>> > > This brings me to issue three. You are currently using
>> TTI->getCallCost() which is not meant to be used with the vectorizer. We
>> should create a getCallInstrCost() function similar to the
>> "getIntrinsicInstrCost" function we already have.
>> > >
>> > > BasicTTI::getCallInstrCost should query TLI->isFunctionVectorizable()
>> and return a sensible value in this case (one that is lower than a
>> scalarized intrinsic lowered as lib call).
>> >
>> > I don't understand the difference between getIntrinsicCost and
>> getIntrinsicInstrCost. They both take the same arguments (but return
>> different values), and the doxygen docstring does not describe the action
>> in enough detail to discern what the required behaviour is.
>> >
>> > Could you please tell me? (and I'll update the docstrings while I'm at
>> it).
>>
>> Sure, TargetTransformInfo is split into two "cost" metrics:
>>
>> * Generic target information which returns its cost in terms of
>> "TargetCostConstants":
>>
>> /// \name Generic Target Information
>> /// @{
>>
>> /// \brief Underlying constants for 'cost' values in this interface.
>> ///
>> /// Many APIs in this interface return a cost. This enum defines the
>> /// fundamental values that should be used to interpret (and produce)
>> those
>> /// costs. The costs are returned as an unsigned rather than a member
>> of this
>> /// enumeration because it is expected that the cost of one IR
>> instruction
>> /// may have a multiplicative factor to it or otherwise won't fit
>> directly
>> /// into the enum. Moreover, it is common to sum or average costs which
>> works
>> /// better as simple integral values. Thus this enum only provides
>> constants.
>> ...
>> /// @}
>>
>> This api is used by the inliner (getUserCost) to estimate the cost (size)
>> of instructions.
>>
>> * Throughput estimate for the vectorizer. This api attempts to estimate
>> (very crudely on a instruction per instruction basis) the throughput of
>> instructions (since we automatically infer most values using
>> TargetLoweringInfo, and we have to do this from IR this is not going to be
>> very accurate ...).
>>
>> /// \name Vector Target Information
>> /// @{
>> ...
>> /// \return The expected cost of arithmetic ops, such as mul, xor, fsub,
>> et
>> virtual unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,
>> ...
>> /// @}
>>
>> At a high level, this api tries to answer the question: What does this
>> instruction cost in a scalar form ("expf", f32). Or what does this
>> instruction cost in a vectorized form ("expf", <4 x float>).
>>
>> BasicTTI::getIntrinsicInstrCost() assumes a cost of 1 for intrinsics that
>> have a corresponding ISA instruction
>> (TLoweringI->isOperationLegalOrPromote(ISD:FEXP) returns true), a cost of
>> 10 for the ones that don't and then we also incorporate things like type
>> legalization costs, and overhead if we vectorize.
>>
>> For the new BasicTTI::getCallInstrCost(Function, RetTy, ArgTys) we would
>> also return 10 for scalar versions of the function (RetTy->isVectorTy() ==
>> false).
>> For vector queries (RetTy->isVectorTy()==true), if there is a a
>> TLibInfo->isVectorizableFunction(Function->getCalledFunction->getName(),
>> RetTy->getVectorNumElements()) we should also return 10. Otherwise, we
>> estimate the cost of scalarization just like we do in
>> getIntrinsicInstrCost. This will guarantee that the vectorize library
>> function call (Cost = 10) will be chosen over the intrinsic lowered to a
>> sequence of scalarized lib calls (Cost = 10 * VF * ...).
>>
>> Then, in LoopVectorizationCostModel::getInstructionCost() you would query
>> both (if getInstrinsicIDForCall returns an id) apis and return the smallest:
>>
>> case Call:
>> CallInst *CI = cast<CallInst>(I);
>>
>>
>> Type *RetTy = ToVectorTy(CI->getType(), VF);
>> SmallVector<Type*, 4> Tys;
>> for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
>> Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
>> unsigned LibFuncCallCost =
>> TTI.getCallInstrCost(CI->getCalledFunction(), RetTy, Tys);
>>
>> if (unsigned ID = getIntrinsicIDForCall(CI, TLI))
>> return std::min(LibFuncCallCost, TTI.getIntrinsicInstrCost(ID,
>> RetTy, Tys));
>> return LibFuncCallCost;
>>
>>
>> Thanks,
>> Arnold
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140317/28a995d5/attachment.html>
-------------- next part --------------
Index: include/llvm/Analysis/TargetTransformInfo.h
===================================================================
--- include/llvm/Analysis/TargetTransformInfo.h (revision 204039)
+++ include/llvm/Analysis/TargetTransformInfo.h (working copy)
@@ -389,6 +389,10 @@
virtual unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys) const;
+ /// \returns The cost of Call instructions.
+ virtual unsigned getCallInstrCost(Function *F, Type *RetTy,
+ ArrayRef<Type *> Tys) const;
+
/// \returns The number of pieces into which the provided type must be
/// split during legalization. Zero is returned when the answer is unknown.
virtual unsigned getNumberOfParts(Type *Tp) const;
Index: include/llvm/Target/TargetLibraryInfo.h
===================================================================
--- include/llvm/Target/TargetLibraryInfo.h (revision 204039)
+++ include/llvm/Target/TargetLibraryInfo.h (working copy)
@@ -11,6 +11,7 @@
#define LLVM_TARGET_TARGETLIBRARYINFO_H
#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/ArrayRef.h"
#include "llvm/Pass.h"
namespace llvm {
@@ -673,10 +674,26 @@
/// library functions are available for the current target, and allows a
/// frontend to disable optimizations through -fno-builtin etc.
class TargetLibraryInfo : public ImmutablePass {
+public:
+ /// VecDesc - Describes a possible vectorization of a function.
+ /// Function 'VectorFnName' is equivalent to 'ScalarFnName' vectorized
+ /// by a factor 'VectorizationFactor'.
+ struct VecDesc {
+ const char *ScalarFnName;
+ const char *VectorFnName;
+ unsigned VectorizationFactor;
+ };
+
+private:
virtual void anchor();
unsigned char AvailableArray[(LibFunc::NumLibFuncs+3)/4];
llvm::DenseMap<unsigned, std::string> CustomNames;
static const char* StandardNames[LibFunc::NumLibFuncs];
+ /// Vectorization descriptors - sorted by ScalarFnName.
+ std::vector<VecDesc> VectorDescs;
+ /// Scalarization descriptors - same content as VectorDescs but sorted based
+ /// on VectorFnName rather than ScalarFnName.
+ std::vector<VecDesc> ScalarDescs;
enum AvailabilityState {
StandardName = 3, // (memset to all ones)
@@ -772,6 +789,38 @@
/// disableAllFunctions - This disables all builtins, which is used for
/// options like -fno-builtin.
void disableAllFunctions();
+
+ /// addVectorizableFunctions - Add a set of scalar -> vector mappings,
+ /// queryable via getVectorizedFunction and getScalarizedFunction.
+ void addVectorizableFunctions(ArrayRef<VecDesc> Fns);
+
+ /// isFunctionVectorizable - Return true if the function F has a
+ /// vector equivalent with vectorization factor VF.
+ bool isFunctionVectorizable(StringRef F, unsigned VF) const {
+ return !getVectorizedFunction(F, VF).empty();
+ }
+
+ /// isFunctionVectorizable - Return true if the function F has a
+ /// vector equivalent with any vectorization factor.
+ bool isFunctionVectorizable(StringRef F) const;
+
+ /// getVectorizedFunction - Return the name of the equivalent of
+ /// F, vectorized with factor VF. If no such mapping exists,
+ /// return the empty string.
+ StringRef getVectorizedFunction(StringRef F, unsigned VF) const;
+
+ /// isFunctionScalarizable - Return true if the function F has a
+ /// scalar equivalent, and set VF to be the vectorization factor.
+ bool isFunctionScalarizable(StringRef F, unsigned &VF) const {
+ return !getScalarizedFunction(F, VF).empty();
+ }
+
+ /// getScalarizedFunction - Return the name of the equivalent of
+ /// F, scalarized. If no such mapping exists, return the empty string.
+ ///
+ /// Set VF to the vectorization factor.
+ StringRef getScalarizedFunction(StringRef F, unsigned &VF) const;
+
};
} // end namespace llvm
Index: test/Transforms/LoopVectorize/funcall.ll
===================================================================
--- test/Transforms/LoopVectorize/funcall.ll (revision 204039)
+++ test/Transforms/LoopVectorize/funcall.ll (working copy)
@@ -1,6 +1,7 @@
; RUN: opt -S -loop-vectorize -force-vector-width=2 -force-vector-unroll=1 < %s | FileCheck %s
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
+target triple = "armv7-apple-macos-Accelerate"
; Make sure we can vectorize loops with functions to math library functions.
; They might read the rounding mode but we are only vectorizing loops that
Index: test/Transforms/LoopVectorize/libcall.ll
===================================================================
--- test/Transforms/LoopVectorize/libcall.ll (revision 0)
+++ test/Transforms/LoopVectorize/libcall.ll (working copy)
@@ -0,0 +1,55 @@
+; RUN: opt < %s -loop-vectorize -S | FileCheck %s
+
+target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
+target triple = "armv7-apple-macos-Accelerate"
+
+;CHECK-LABEL: @exp_intrinsic_f32(
+;CHECK: vexp
+;CHECK: ret void
+define void @exp_intrinsic_f32(i32 %n, float* noalias %y, float* noalias %x) nounwind uwtable {
+entry:
+ %cmp6 = icmp sgt i32 %n, 0
+ br i1 %cmp6, label %for.body, label %for.end
+
+for.body: ; preds = %entry, %for.body
+ %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
+ %arrayidx = getelementptr inbounds float* %y, i64 %indvars.iv
+ %0 = load float* %arrayidx, align 4
+ %call = tail call float @llvm.exp.f32(float %0) nounwind readnone
+ %arrayidx2 = getelementptr inbounds float* %x, i64 %indvars.iv
+ store float %call, float* %arrayidx2, align 4
+ %indvars.iv.next = add i64 %indvars.iv, 1
+ %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+ %exitcond = icmp eq i32 %lftr.wideiv, %n
+ br i1 %exitcond, label %for.end, label %for.body
+
+for.end: ; preds = %for.body, %entry
+ ret void
+}
+
+;CHECK-LABEL: @exp_libcall_f32(
+;CHECK: vexp
+;CHECK: ret void
+define void @exp_libcall_f32(i32 %n, float* noalias %y, float* noalias %x) nounwind uwtable {
+entry:
+ %cmp6 = icmp sgt i32 %n, 0
+ br i1 %cmp6, label %for.body, label %for.end
+
+for.body: ; preds = %entry, %for.body
+ %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
+ %arrayidx = getelementptr inbounds float* %y, i64 %indvars.iv
+ %0 = load float* %arrayidx, align 4
+ %call = tail call float @expf(float %0) nounwind readnone
+ %arrayidx2 = getelementptr inbounds float* %x, i64 %indvars.iv
+ store float %call, float* %arrayidx2, align 4
+ %indvars.iv.next = add i64 %indvars.iv, 1
+ %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+ %exitcond = icmp eq i32 %lftr.wideiv, %n
+ br i1 %exitcond, label %for.end, label %for.body
+
+for.end: ; preds = %for.body, %entry
+ ret void
+}
+
+declare float @llvm.exp.f32(float) nounwind readnone
+declare float @expf(float) nounwind readnone
\ No newline at end of file
Index: test/Analysis/CostModel/X86/intrinsic-cost.ll
===================================================================
--- test/Analysis/CostModel/X86/intrinsic-cost.ll (revision 204039)
+++ test/Analysis/CostModel/X86/intrinsic-cost.ll (working copy)
@@ -22,7 +22,7 @@
ret void
; CORE2: Printing analysis 'Cost Model Analysis' for function 'test1':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction: %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 40 for instruction: %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
; COREI7: Printing analysis 'Cost Model Analysis' for function 'test1':
; COREI7: Cost Model: Found an estimated cost of 1 for instruction: %2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %wide.load)
@@ -50,7 +50,7 @@
ret void
; CORE2: Printing analysis 'Cost Model Analysis' for function 'test2':
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction: %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 40 for instruction: %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
; COREI7: Printing analysis 'Cost Model Analysis' for function 'test2':
; COREI7: Cost Model: Found an estimated cost of 1 for instruction: %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
Index: lib/Analysis/TargetTransformInfo.cpp
===================================================================
--- lib/Analysis/TargetTransformInfo.cpp (revision 204039)
+++ lib/Analysis/TargetTransformInfo.cpp (working copy)
@@ -215,6 +215,13 @@
return PrevTTI->getIntrinsicInstrCost(ID, RetTy, Tys);
}
+unsigned
+TargetTransformInfo::getCallInstrCost(Function *F,
+ Type *RetTy,
+ ArrayRef<Type *> Tys) const {
+ return PrevTTI->getCallInstrCost(F, RetTy, Tys);
+}
+
unsigned TargetTransformInfo::getNumberOfParts(Type *Tp) const {
return PrevTTI->getNumberOfParts(Tp);
}
@@ -600,6 +607,12 @@
return 1;
}
+ unsigned getCallInstrCost(Function *F,
+ Type *RetTy,
+ ArrayRef<Type*> Tys) const {
+ return 10;
+ }
+
unsigned getNumberOfParts(Type *Tp) const override {
return 0;
}
Index: lib/Target/TargetLibraryInfo.cpp
===================================================================
--- lib/Target/TargetLibraryInfo.cpp (revision 204039)
+++ lib/Target/TargetLibraryInfo.cpp (working copy)
@@ -663,6 +663,17 @@
TLI.setUnavailable(LibFunc::statvfs64);
TLI.setUnavailable(LibFunc::tmpfile64);
}
+
+ // The Accelerate library adds vectorizable variants of many
+ // standard library functions.
+ // FIXME: Make the following list complete.
+ if (T.getEnvironmentName() == "Accelerate") {
+ const TargetLibraryInfo::VecDesc VecFuncs[] = {
+ {"expf", "vexpf", 4},
+ {"llvm.exp.f32", "vexpf", 4}
+ };
+ TLI.addVectorizableFunctions(VecFuncs);
+ }
}
@@ -686,23 +697,17 @@
CustomNames = TLI.CustomNames;
}
-namespace {
-struct StringComparator {
- /// Compare two strings and return true if LHS is lexicographically less than
- /// RHS. Requires that RHS doesn't contain any zero bytes.
- bool operator()(const char *LHS, StringRef RHS) const {
- // Compare prefixes with strncmp. If prefixes match we know that LHS is
- // greater or equal to RHS as RHS can't contain any '\0'.
- return std::strncmp(LHS, RHS.data(), RHS.size()) < 0;
- }
+static StringRef sanitizeFunctionName(StringRef funcName) {
+ // Filter out empty names and names containing null bytes, those can't be in
+ // our table.
+ if (funcName.empty() || funcName.find('\0') != StringRef::npos)
+ return StringRef();
- // Provided for compatibility with MSVC's debug mode.
- bool operator()(StringRef LHS, const char *RHS) const { return LHS < RHS; }
- bool operator()(StringRef LHS, StringRef RHS) const { return LHS < RHS; }
- bool operator()(const char *LHS, const char *RHS) const {
- return std::strcmp(LHS, RHS) < 0;
- }
-};
+ // Check for \01 prefix that is used to mangle __asm declarations and
+ // strip it if present.
+ if (funcName.front() == '\01')
+ funcName = funcName.substr(1);
+ return funcName;
}
bool TargetLibraryInfo::getLibFunc(StringRef funcName,
@@ -710,16 +715,13 @@
const char **Start = &StandardNames[0];
const char **End = &StandardNames[LibFunc::NumLibFuncs];
- // Filter out empty names and names containing null bytes, those can't be in
- // our table.
- if (funcName.empty() || funcName.find('\0') != StringRef::npos)
+ funcName = sanitizeFunctionName(funcName);
+ if (funcName.empty())
return false;
- // Check for \01 prefix that is used to mangle __asm declarations and
- // strip it if present.
- if (funcName.front() == '\01')
- funcName = funcName.substr(1);
- const char **I = std::lower_bound(Start, End, funcName, StringComparator());
+ const char **I = std::lower_bound(Start, End, funcName, [](const char *LHS, StringRef RHS) {
+ return std::strncmp(LHS, RHS.data(), RHS.size()) < 0;
+ });
if (I != End && *I == funcName) {
F = (LibFunc::Func)(I - Start);
return true;
@@ -732,3 +734,77 @@
void TargetLibraryInfo::disableAllFunctions() {
memset(AvailableArray, 0, sizeof(AvailableArray));
}
+
+void TargetLibraryInfo::addVectorizableFunctions(ArrayRef<VecDesc> Fns) {
+ VectorDescs.insert(VectorDescs.end(), Fns.begin(), Fns.end());
+ std::sort(VectorDescs.begin(), VectorDescs.end(),
+ [](const VecDesc &LHS, const VecDesc &RHS) {
+ return std::strncmp(LHS.ScalarFnName, RHS.ScalarFnName,
+ std::strlen(RHS.ScalarFnName)) < 0;
+ });
+
+ ScalarDescs.insert(ScalarDescs.end(), Fns.begin(), Fns.end());
+ std::sort(ScalarDescs.begin(), ScalarDescs.end(),
+ [](const VecDesc &LHS, const VecDesc &RHS) {
+ return std::strncmp(LHS.VectorFnName, RHS.VectorFnName,
+ std::strlen(RHS.VectorFnName)) < 0;
+ });
+}
+
+bool TargetLibraryInfo::isFunctionVectorizable(StringRef funcName) const {
+ funcName = sanitizeFunctionName(funcName);
+ if (funcName.empty())
+ return false;
+
+ std::vector<VecDesc>::const_iterator I =
+ std::lower_bound(VectorDescs.begin(),
+ VectorDescs.end(),
+ funcName,
+ [](const VecDesc &LHS, StringRef S) {
+ return std::strncmp(LHS.ScalarFnName, S.data(),
+ S.size()) < 0;
+ });
+ return I != VectorDescs.end();
+}
+
+StringRef TargetLibraryInfo::getVectorizedFunction(StringRef F,
+ unsigned VF) const {
+ F = sanitizeFunctionName(F);
+ if (F.empty())
+ return F;
+
+ std::vector<VecDesc>::const_iterator I =
+ std::lower_bound(VectorDescs.begin(),
+ VectorDescs.end(),
+ F,
+ [](const VecDesc &LHS, StringRef S) {
+ return std::strncmp(LHS.ScalarFnName, S.data(),
+ S.size()) < 0;
+ });
+ while (I != VectorDescs.end() && StringRef(I->ScalarFnName) == F) {
+ if (I->VectorizationFactor == VF)
+ return I->VectorFnName;
+ ++I;
+ }
+ return StringRef();
+}
+
+StringRef TargetLibraryInfo::getScalarizedFunction(StringRef F,
+ unsigned &VF) const {
+ F = sanitizeFunctionName(F);
+ if (F.empty())
+ return F;
+
+ std::vector<VecDesc>::const_iterator I =
+ std::lower_bound(ScalarDescs.begin(),
+ ScalarDescs.end(),
+ F,
+ [](const VecDesc &LHS, StringRef S) {
+ return std::strncmp(LHS.VectorFnName, S.data(),
+ S.size()) < 0;
+ });
+ if (I != VectorDescs.end())
+ return StringRef();
+ VF = I->VectorizationFactor;
+ return I->ScalarFnName;
+}
Index: lib/CodeGen/BasicTargetTransformInfo.cpp
===================================================================
--- lib/CodeGen/BasicTargetTransformInfo.cpp (revision 204039)
+++ lib/CodeGen/BasicTargetTransformInfo.cpp (working copy)
@@ -16,8 +16,10 @@
//===----------------------------------------------------------------------===//
#define DEBUG_TYPE "basictti"
+#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/CodeGen/Passes.h"
-#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/Target/TargetLibraryInfo.h"
#include "llvm/Target/TargetLowering.h"
#include <utility>
using namespace llvm;
@@ -105,6 +107,8 @@
unsigned AddressSpace) const override;
unsigned getIntrinsicInstrCost(Intrinsic::ID, Type *RetTy,
ArrayRef<Type*> Tys) const override;
+ unsigned getCallInstrCost(Function *F, Type *RetTy,
+ ArrayRef<Type*> Tys) const override;
unsigned getNumberOfParts(Type *Tp) const override;
unsigned getAddressComputationCost( Type *Ty, bool IsComplex) const override;
unsigned getReductionCost(unsigned Opcode, Type *Ty,
@@ -434,7 +438,7 @@
for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
if (Tys[i]->isVectorTy()) {
ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
- ScalarCalls = std::max(ScalarCalls, RetTy->getVectorNumElements());
+ ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
}
}
@@ -493,13 +497,40 @@
unsigned Num = RetTy->getVectorNumElements();
unsigned Cost = TopTTI->getIntrinsicInstrCost(IID, RetTy->getScalarType(),
Tys);
- return 10 * Cost * Num;
+ return Cost * Num;
}
// This is going to be turned into a library call, make it expensive.
return 10;
}
+unsigned BasicTTI::getCallInstrCost(Function *F, Type *RetTy,
+ ArrayRef<Type *> Tys) const {
+
+ // Scalar function calls are always expensive.
+ if (!RetTy->isVectorTy())
+ return 10;
+
+ const TargetLibraryInfo *TLI = getAnalysisIfAvailable<TargetLibraryInfo>();
+
+ // Functions with a vector form are no more expensive than a scalar call.
+ if (TLI && TLI->isFunctionVectorizable(F->getName(),
+ RetTy->getVectorNumElements()))
+ return 10;
+
+ // We have to scalarize this function call. Estimate the cost.
+ unsigned ScalarizationCost = getScalarizationOverhead(RetTy, true, false);
+ unsigned ScalarCalls = RetTy->getVectorNumElements();
+ for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
+ if (Tys[i]->isVectorTy()) {
+ ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
+ ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
+ }
+ }
+
+ return ScalarCalls * 10 + ScalarizationCost;
+}
+
unsigned BasicTTI::getNumberOfParts(Type *Tp) const {
std::pair<unsigned, MVT> LT = getTLI()->getTypeLegalizationCost(Tp);
return LT.first;
Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/LoopVectorize.cpp (revision 204039)
+++ lib/Transforms/Vectorize/LoopVectorize.cpp (working copy)
@@ -220,11 +220,12 @@
public:
InnerLoopVectorizer(Loop *OrigLoop, ScalarEvolution *SE, LoopInfo *LI,
DominatorTree *DT, const DataLayout *DL,
- const TargetLibraryInfo *TLI, unsigned VecWidth,
- unsigned UnrollFactor)
- : OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), DL(DL), TLI(TLI),
- VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()), Induction(0),
- OldInduction(0), WidenMap(UnrollFactor), Legal(0) {}
+ const TargetLibraryInfo *TLI,
+ const TargetTransformInfo *TTI,
+ unsigned VecWidth, unsigned UnrollFactor)
+ : OrigLoop(OrigLoop), SE(SE), LI(LI), DT(DT), DL(DL), TLI(TLI), TTI(TTI),
+ VF(VecWidth), UF(UnrollFactor), Builder(SE->getContext()), Induction(0),
+ OldInduction(0), WidenMap(UnrollFactor), Legal(0) {}
// Perform the actual loop widening (vectorization).
void vectorize(LoopVectorizationLegality *L) {
@@ -382,6 +383,8 @@
const DataLayout *DL;
/// Target Library Info.
const TargetLibraryInfo *TLI;
+ /// Target Transform Info.
+ const TargetTransformInfo *TTI;
/// The vectorization SIMD factor to use. Each vector will have this many
/// vector elements.
@@ -429,8 +432,9 @@
public:
InnerLoopUnroller(Loop *OrigLoop, ScalarEvolution *SE, LoopInfo *LI,
DominatorTree *DT, const DataLayout *DL,
- const TargetLibraryInfo *TLI, unsigned UnrollFactor) :
- InnerLoopVectorizer(OrigLoop, SE, LI, DT, DL, TLI, 1, UnrollFactor) { }
+ const TargetLibraryInfo *TLI, const TargetTransformInfo *TTI,
+ unsigned UnrollFactor) :
+ InnerLoopVectorizer(OrigLoop, SE, LI, DT, DL, TLI, TTI, 1, UnrollFactor) { }
private:
void scalarizeInstruction(Instruction *Instr,
@@ -829,11 +833,6 @@
/// width. Vector width of one means scalar.
unsigned getInstructionCost(Instruction *I, unsigned VF);
- /// A helper function for converting Scalar types to vector types.
- /// If the incoming type is void, we return void. If the VF is 1, we return
- /// the scalar type.
- static Type* ToVectorTy(Type *Scalar, unsigned VF);
-
/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);
@@ -1146,11 +1145,11 @@
return false;
DEBUG(dbgs() << "LV: Trying to at least unroll the loops.\n");
// We decided not to vectorize, but we may want to unroll.
- InnerLoopUnroller Unroller(L, SE, LI, DT, DL, TLI, UF);
+ InnerLoopUnroller Unroller(L, SE, LI, DT, DL, TLI, TTI, UF);
Unroller.vectorize(&LVL);
} else {
// If we decided that it is *legal* to vectorize the loop then do it.
- InnerLoopVectorizer LB(L, SE, LI, DT, DL, TLI, VF.Width, UF);
+ InnerLoopVectorizer LB(L, SE, LI, DT, DL, TLI, TTI, VF.Width, UF);
LB.vectorize(&LVL);
}
@@ -1224,6 +1223,15 @@
return SE->getSCEV(Ptr);
}
+/// A helper function for converting Scalar types to vector types.
+/// If the incoming type is void, we return void. If the VF is 1, we return
+/// the scalar type.
+static Type* ToVectorTy(Type *Scalar, unsigned VF) {
+ if (Scalar->isVoidTy() || VF == 1)
+ return Scalar;
+ return VectorType::get(Scalar, VF);
+}
+
void LoopVectorizationLegality::RuntimePointerCheck::insert(
ScalarEvolution *SE, Loop *Lp, Value *Ptr, bool WritePtr, unsigned DepSetId,
ValueToValueMap &Strides) {
@@ -3115,28 +3123,105 @@
Module *M = BB->getParent()->getParent();
CallInst *CI = cast<CallInst>(it);
+ StringRef FnName = CI->getCalledFunction()->getName();
+ Function *F = CI->getCalledFunction();
+ Type *RetTy = ToVectorTy(CI->getType(), VF);
+ SmallVector<Type*, 4> Tys;
+ for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
+ Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
+
Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);
- assert(ID && "Not an intrinsic call!");
- switch (ID) {
- case Intrinsic::lifetime_end:
- case Intrinsic::lifetime_start:
- scalarizeInstruction(it);
- break;
- default:
+ if (ID && TTI->getIntrinsicInstrCost(ID, RetTy, Tys) <
+ TTI->getCallInstrCost(F, RetTy, Tys)) {
+ switch (ID) {
+ case Intrinsic::lifetime_end:
+ case Intrinsic::lifetime_start:
+ scalarizeInstruction(it);
+ break;
+ default:
+ for (unsigned Part = 0; Part < UF; ++Part) {
+ SmallVector<Value *, 4> Args;
+ for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
+ VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
+ Args.push_back(Arg[Part]);
+ }
+ Type *Tys[] = {CI->getType()};
+ if (VF > 1)
+ Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
+
+ Function *F = Intrinsic::getDeclaration(M, ID, Tys);
+ Entry[Part] = Builder.CreateCall(F, Args);
+ }
+ break;
+ }
+ } else if (TLI && TLI->isFunctionVectorizable(FnName, VF)) {
+ // This is a function with a vector form.
+ StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
+ assert(!VFnName.empty());
+
+ Function *VectorF = M->getFunction(VFnName);
+ if (!VectorF) {
+ // Generate a declaration
+ FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
+ VectorF = Function::Create(FTy, Function::ExternalLinkage, VFnName, M);
+ assert(VectorF);
+ }
+
for (unsigned Part = 0; Part < UF; ++Part) {
SmallVector<Value *, 4> Args;
for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
Args.push_back(Arg[Part]);
}
- Type *Tys[] = {CI->getType()};
- if (VF > 1)
- Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
- Function *F = Intrinsic::getDeclaration(M, ID, Tys);
- Entry[Part] = Builder.CreateCall(F, Args);
+ Entry[Part] = Builder.CreateCall(VectorF, Args);;
}
- break;
+ } else {
+ // We have a function call that has no vector form - we must scalarize
+ // it.
+ // FIXME: We could check if it has a vector form for smaller values of
+ // VF, then chain them together instead of bailing and being fully
+ // scalar.
+ bool IsVoidTy = CI->getType()->isVoidTy();
+
+ for (unsigned UPart = 0; UPart < UF; ++UPart) {
+ Value *VRet = NULL;
+ // If we have to return something, start with an undefined vector and
+ // fill it in element by element.
+ if (!IsVoidTy)
+ VRet = UndefValue::get(VectorType::get(CI->getType(), VF));
+
+ for (unsigned VPart = 0; VPart < VF; ++VPart) {
+
+ SmallVector<Value *, 4> Args;
+ for (unsigned I = 0, IE = CI->getNumArgOperands(); I != IE; ++I) {
+ Value *Operand = CI->getArgOperand(I);
+
+ Instruction *Inst = dyn_cast<Instruction>(Operand);
+ if (!Inst || Legal->isUniformAfterVectorization(Inst)) {
+ // Uniform variable - just use the original scalar argument.
+ Args.push_back(Operand);
+ } else {
+ // Non-uniform.
+ assert(WidenMap.has(Operand) &&
+ "Non-uniform values must be in WidenMap!");
+ Value *VArg = WidenMap.get(Operand)[UPart];
+ Value *Arg =
+ Builder.CreateExtractElement(VArg,
+ Builder.getInt32(VPart));
+ Args.push_back(Arg);
+ }
+ }
+
+ Value *NewCI = Builder.CreateCall(CI->getCalledFunction(), Args);
+
+ if (!IsVoidTy)
+ VRet = Builder.CreateInsertElement(VRet, NewCI,
+ Builder.getInt32(VPart));
+ }
+ Entry[UPart] = VRet;
+ }
+
}
break;
}
@@ -3475,11 +3560,16 @@
return false;
}// end of PHI handling
- // We still don't handle functions. However, we can ignore dbg intrinsic
- // calls and we do handle certain intrinsic and libm functions.
+ // We handle calls that:
+ // * Are debug info intrinsics.
+ // * Have a mapping to an IR intrinsic.
+ // * Have a vector version available.
+
CallInst *CI = dyn_cast<CallInst>(it);
- if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI)) {
- DEBUG(dbgs() << "LV: Found a call site.\n");
+ if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI)
+ && !(CI->getCalledFunction() && TLI &&
+ TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {
+ DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");
return false;
}
@@ -4412,6 +4502,12 @@
if (Call && getIntrinsicIDForCall(Call, TLI))
continue;
+ // If the function has an explicit vectorized counterpart, we can safely
+ // assume that it can be vectorized.
+ if (Call && Call->getCalledFunction() &&
+ TLI->isFunctionVectorizable(Call->getCalledFunction()->getName()))
+ continue;
+
LoadInst *Ld = dyn_cast<LoadInst>(it);
if (!Ld) return false;
if (!Ld->isSimple() && !IsAnnotatedParallel) {
@@ -5616,13 +5712,16 @@
}
case Instruction::Call: {
CallInst *CI = cast<CallInst>(I);
- Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);
- assert(ID && "Not an intrinsic call!");
+ Function *F = CI->getCalledFunction();
Type *RetTy = ToVectorTy(CI->getType(), VF);
SmallVector<Type*, 4> Tys;
for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)
Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));
- return TTI.getIntrinsicInstrCost(ID, RetTy, Tys);
+
+ unsigned Cost = TTI.getCallInstrCost(F, RetTy, Tys);
+ if (Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI))
+ return std::min(Cost, TTI.getIntrinsicInstrCost(ID, RetTy, Tys));
+ return Cost;
}
default: {
// We are scalarizing the instruction. Return the cost of the scalar
@@ -5649,12 +5748,6 @@
}// end of switch.
}
-Type* LoopVectorizationCostModel::ToVectorTy(Type *Scalar, unsigned VF) {
- if (Scalar->isVoidTy() || VF == 1)
- return Scalar;
- return VectorType::get(Scalar, VF);
-}
-
char LoopVectorize::ID = 0;
static const char lv_name[] = "Loop Vectorization";
INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
More information about the llvm-commits
mailing list