RFC: Enable vectorization of call instructions in the loop vectorizer

Thu Mar 5 18:05:40 PST 2015

----- Original Message -----
> From: "Michael Zolotukhin" <mzolotukhin at apple.com>
> To: "Hal J. Finkel" <hfinkel at anl.gov>
> Cc: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "James Molloy" <james at jamesmolloy.co.uk>,
> "Nadav Rotem" <nrotem at apple.com>, "renato golin" <renato.golin at linaro.org>, "Arnold Schwaighofer"
> <aschwaighofer at apple.com>
> Sent: Wednesday, March 4, 2015 9:00:52 PM
> Subject: Re: RFC: Enable vectorization of call instructions in the loop vectorizer
> 
> 
> 
> Hi Hal,
> 
> Thanks for the feedback! I’m attaching updated patches.
> I changed three of them: 0002, 0003 and 0005.
> 
> In the 0002 I fixed a bug which caused the compiler think that it can
> vectorize any function (we did binary search with lower_bound, but
> didn’t check that the element is actually in the table).
> 
> In 0003 and 0005 I factored out cost of the scalar case, as you
> suggested.

Okay, thanks. I think that there are other things I'd like to comment upon, but it is difficult to see the context in the patches. Could you please upload them to phabricator?

 -Hal

> 
> 
> As for the ‘Accelerate’ in the target tuple - I believe that was a
> final decision in the last discussion, so I chose that way. I don’t
> mind switching to something else though.
> 
> Thanks,
> Michael
> > On Mar 4, 2015, at 10:22 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> > 
> > Hi Michael,
> > 
> > Thanks for working on this. Regarding the TTI implementation, you
> > have:
> > 
> > + /// \returns The cost of Call instruction.
> > + unsigned getCallInstrCost(Function *F, Type *RetTy, ArrayRef<Type
> > *> Tys,
> > + const TargetLibraryInfo *TLI) {
> > +
> > + // Scalar function calls are always expensive.
> > + if (!RetTy->isVectorTy())
> > + return 10;
> > +
> > + // Functions with a vector form are no more expensive than a
> > scalar call.
> > + if (TLI &&
> > + TLI->isFunctionVectorizable(F->getName(),
> > + RetTy->getVectorNumElements()))
> > + return 10;
> > 
> > ...
> > 
> > I'm a bit concerned about this '10'. The target needs to be able to
> > override this without replacing this whole function. Maybe the
> > function, when handling vector calls, should call itself with the
> > scalar type to get the base cost? Then the target can override to
> > provide the base scalar cost.
> > 
> > I see that this factor of 10 is also in the cost-model printing
> > code, and we need to change it there too (I think that this value
> > did not matter much previously, but it looks like it will matter
> > much more with this change).
> > 
> > + // The Accelerate library adds vectorizable variants of many
> > + // standard library functions.
> > + // FIXME: Make the following list complete.
> > + if (T.getEnvironmentName() == "Accelerate") {
> > + const VecDesc VecFuncs[] = {
> > 
> > Accelerate? In the triple?
> > 
> > -Hal
> > 
> > ----- Original Message -----
> >> From: "Michael Zolotukhin" <mzolotukhin at apple.com>
> >> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> >> Cc: "Commit Messages and Patches for LLVM"
> >> <llvm-commits at cs.uiuc.edu>, "James Molloy"
> >> <james at jamesmolloy.co.uk>,
> >> "Nadav Rotem" <nrotem at apple.com>, "Hal J. Finkel"
> >> <hfinkel at anl.gov>, "renato golin" <renato.golin at linaro.org>
> >> Sent: Monday, March 2, 2015 10:25:15 PM
> >> Subject: Re: RFC: Enable vectorization of call instructions in the
> >> loop vectorizer
> >> 
> >> 
> >> 
> >> Hi Arnold,
> >> 
> >> Thank you for the review! I’ve committed 0001, 0002, 0003 and 0005
> >> patches, which don’t add any new functionality.
> >> 
> >> I’ll proceed with the further patches later this week if no one
> >> else
> >> has objections and my testing goes fine. For the references these
> >> are the patches (updated according to your remark and renumbered
> >> because some patches are already committed):
> >> 
> >> 
> >> 
> >> 
> >> 
> >> Thanks,
> >> Michael
> >> 
> >>> On Feb 28, 2015, at 7:47 AM, Arnold Schwaighofer
> >>> <aschwaighofer at apple.com> wrote:
> >>> 
> >>>> <0001-Make-ToVectorTy-static.patch>
> >>> 
> >>> LGTM.
> >>> 
> >>> 
> >>>> <0002-Fix-a-copy-paste-bug.patch>
> >>> 
> >>> LGTM.
> >>> 
> >>>> <0003-TLI-Use-lambda.-NFC.patch>
> >>> 
> >>> LGTM.
> >>> 
> >>>> <0004-LoopVectorizer-Add-TargetTransformInfo.patch>
> >>> 
> >>> LGTM.
> >>> 
> >>>> <0005-TLI-Factor-out-sanitizeFunctionName.-NFC.patch>
> >>> 
> >>> LGTM.
> >>> 
> >>>> <0006-TLI-Add-interface-for-querying-whether-a-function-is.patch>
> >>> 
> >>> LGTM with one little change. AFAICT, we can reuse one lambda
> >>> function (the second to last one) ;b. Seriously though, I think
> >>> this would also easier to read if we factored out the lambdas
> >>> into
> >>> a static utility functions:
> >>> 
> >>> 
> >>> std::sort(VectorDescs.begin(), VectorDescs.end(),
> >>> compareByScalarName)
> >>> 
> >>> std::sort(ScalarDescs.begin(), ScalarDescs.end(),
> >>> compareByVectorName)
> >>> 
> >>> std::lower_bound(VectorDescs.begin(), VectorDescs.end(),
> >>> FunctionName, compareScalarNameWith)
> >>> 
> >>> etc.
> >>> 
> >>> +void
> >>> TargetLibraryInfoImpl::addVectorizableFunctions(ArrayRef<VecDesc>
> >>> Fns) {
> >>> + VectorDescs.insert(VectorDescs.end(), Fns.begin(), Fns.end());
> >>> + std::sort(VectorDescs.begin(), VectorDescs.end(),
> >>> + [](const VecDesc &LHS, const VecDesc &RHS) {
> >>> + return std::strncmp(LHS.ScalarFnName, RHS.ScalarFnName,
> >>> + std::strlen(RHS.ScalarFnName)) < 0;
> >>> + });
> >>> +
> >>> + ScalarDescs.insert(ScalarDescs.end(), Fns.begin(), Fns.end());
> >>> + std::sort(ScalarDescs.begin(), ScalarDescs.end(),
> >>> + [](const VecDesc &LHS, const VecDesc &RHS) {
> >>> + return std::strncmp(LHS.VectorFnName, RHS.VectorFnName,
> >>> + std::strlen(RHS.VectorFnName)) < 0;
> >>> + });
> >>> +}
> >>> +
> >>> +bool TargetLibraryInfoImpl::isFunctionVectorizable(StringRef
> >>> funcName) const {
> >>> + funcName = sanitizeFunctionName(funcName);
> >>> + if (funcName.empty())
> >>> + return false;
> >>> +
> >>> + std::vector<VecDesc>::const_iterator I = std::lower_bound(
> >>> + VectorDescs.begin(), VectorDescs.end(), funcName,
> >>> + [](const VecDesc &LHS, StringRef S) {
> >>> + return std::strncmp(LHS.ScalarFnName, S.data(), S.size()) < 0;
> >>> + });
> >>> + return I != VectorDescs.end();
> >>> +}
> >>> +
> >>> +StringRef TargetLibraryInfoImpl::getVectorizedFunction(StringRef
> >>> F,
> >>> + unsigned VF) const {
> >>> + F = sanitizeFunctionName(F);
> >>> + if (F.empty())
> >>> + return F;
> >>> + std::vector<VecDesc>::const_iterator I = std::lower_bound(
> >>> + VectorDescs.begin(), VectorDescs.end(), F,
> >>> + [](const VecDesc &LHS, StringRef S) {
> >>> + return std::strncmp(LHS.ScalarFnName, S.data(), S.size()) < 0;
> >>> + });
> >>> + while (I != VectorDescs.end() && StringRef(I->ScalarFnName) ==
> >>> F)
> >>> {
> >>> + if (I->VectorizationFactor == VF)
> >>> + return I->VectorFnName;
> >>> + ++I;
> >>> + }
> >>> + return StringRef();
> >>> +}
> >>> +
> >>> +StringRef TargetLibraryInfoImpl::getScalarizedFunction(StringRef
> >>> F,
> >>> + unsigned &VF) const {
> >>> + F = sanitizeFunctionName(F);
> >>> + if (F.empty())
> >>> + return F;
> >>> +
> >>> + std::vector<VecDesc>::const_iterator I = std::lower_bound(
> >>> + ScalarDescs.begin(), ScalarDescs.end(), F,
> >>> + [](const VecDesc &LHS, StringRef S) {
> >>> + return std::strncmp(LHS.VectorFnName, S.data(), S.size()) < 0;
> >>> + });
> >>> + if (I != VectorDescs.end())
> >>> + return StringRef();
> >>> + VF = I->VectorizationFactor;
> >>> + return I->ScalarFnName;
> >>> +}
> >>> +
> >>>> <0007-TTI-Add-getCallInstrCost.patch>
> >>> 
> >>> LGTM.
> >>>> <0008-LoopVectorize-teach-loop-vectorizer-to-vectorize-cal.patch>
> >>> LGTM.
> >>>> <0009-TTI-Honour-cost-model-for-estimating-cost-of-vector-.patch>
> >>> LGTM.
> >>>> <0010-TLI-Add-several-entries-to-VectorizableFunctions-tab.patch>
> >>> LGTM.
> >>> 
> >>> Thanks!
> >>> 
> >>> 
> >>> 
> >>> 
> >>>> On Feb 27, 2015, at 4:04 PM, Michael Zolotukhin
> >>>> <mzolotukhin at apple.com> wrote:
> >>>> 
> >>>> Hi Arnold,
> >>>> 
> >>>> These are sligthly updated patches 0007 and 0008 - I added some
> >>>> description for getCallInstrCost and refactored changes in loop
> >>>> vectorizer. I don’t know if these changes actually look better
> >>>> now, but at least we don’t have duplicated code :)
> >>>> 
> >>>> <0007-TTI-Add-getCallInstrCost.patch>
> >>>> <0008-LoopVectorize-teach-loop-vectorizer-to-vectorize-cal.patch>
> >>>> 
> >>>> What is the best way to proceed with this set of the patches?
> >>>> Should I create a separate phabricator review for each of them?
> >>>> 
> >>>> Also, the 0002 patch is an independent bug fix - is it ok to
> >>>> commit it?
> >>>> 
> >>>> Thanks,
> >>>> Michael
> >>>> 
> >>>>> On Feb 26, 2015, at 4:57 PM, Michael Zolotukhin
> >>>>> <mzolotukhin at apple.com> wrote:
> >>>>> 
> >>>>> Hi Arnold,
> >>>>> 
> >>>>>> On Feb 26, 2015, at 1:49 PM, Arnold Schwaighofer
> >>>>>> <aschwaighofer at apple.com> wrote:
> >>>>>> 
> >>>>>> Hi Michael,
> >>>>>> 
> >>>>>> thanks for following up on this. I think it is important to
> >>>>>> move
> >>>>>> forward with this.
> >>>>>> 
> >>>>>>> On Feb 26, 2015, at 12:01 PM, Michael Zolotukhin
> >>>>>>> <mzolotukhin at apple.com> wrote:
> >>>>>>> 
> >>>>>>> Hi,
> >>>>>>> 
> >>>>>>> Some time ago James posted a patch for vectorization of calls
> >>>>>>> [1]. Though it seems to be a missing feature in the
> >>>>>>> vectorizer, it wasn’t committed, so I decided to revive the
> >>>>>>> discussion.
> >>>>>>> 
> >>>>>>> To begin with, I rebased the patch and split it to several
> >>>>>>> semantically separated parts. The approach used here is
> >>>>>>> pretty
> >>>>>>> natural: TargetLibraryInfo has a mapping for a set of scalar
> >>>>>>> functions to their vector counterparts, TargetTransformInfo
> >>>>>>> consults with it when cost of call instruction is calculated,
> >>>>>>> if the vector-version is cheaper, vectorizer emits
> >>>>>>> vector-function.
> >>>>>>> 
> >>>>>>> There are several question on this patch:
> >>>>>>> 1. Is this approach still viable in general and worth
> >>>>>>> pursuing?
> >>>>>> 
> >>>>>> I believe this is still important. We want to be able to
> >>>>>> vectorize function calls if vectorized versions of math
> >>>>>> library
> >>>>>> functions are available.
> >>>>> Thanks, I’ll continue digging this then:)
> >>>>>> 
> >>>>>> 
> >>>>>>> 2. The cost model is changed. In theory it should become more
> >>>>>>> accurate, but I haven’t measured performance impact of this
> >>>>>>> change yet. Does these changes look good in general? If yes,
> >>>>>>> do we need to tune some target-cost-model parameters, or
> >>>>>>> those
> >>>>>>> are the values that no one has been actually using yet?
> >>>>>> 
> >>>>>> I think we will have to test performance impact of this change
> >>>>>> since we are most likely the first clients. We should test
> >>>>>> with/without the target tuple that enables the vectorized
> >>>>>> function calls. I think if we don’t see egregious regressions
> >>>>>> in the run that uses the modified cost model but is a run
> >>>>>> without vectorized versions of the math library being
> >>>>>> available
> >>>>>> we can commit this patch set such that it gets testing by the
> >>>>>> community.
> >>>>> That sounds good.
> >>>>>> 
> >>>>>> 
> >>>>>>> 3. What is the best way to populate the scalar to vector
> >>>>>>> functions map? For now, I populate it when the target-triple
> >>>>>>> contains ‘Accelerate’. If we have a lot of other cases, we
> >>>>>>> might think of something different here.
> >>>>>> 
> >>>>>> We use the environment substring of the target tuple (in this
> >>>>>> case quadruple). This is in spirit to what we are already to
> >>>>>> conditionally turn on/off features in TargetLibraryInfo. I
> >>>>>> think we should address the issue of scalability when we have
> >>>>>> a
> >>>>>> client where this gets to be an issue.
> >>>>> Sounds good.
> >>>>>> 
> >>>>>> 
> >>>>>>> 
> >>>>>>> Any feedback or ideas are appreciated!
> >>>>>>> 
> >>>>>>> Thanks,
> >>>>>>> Michael
> >>>>>>> 
> >>>>>>> [1]:
> >>>>>>> http://thread.gmane.org/gmane.comp.compilers.llvm.cvs/169736
> >>>>>>> 
> >>>>>>> 
> >>>>>>> <0001-Make-ToVectorTy-static.patch>
> >>>>>>> <0002-Fix-a-copy-paste-bug.patch>
> >>>>>>> <0003-TLI-Use-lambda.-NFC.patch>
> >>>>>>> <0004-LoopVectorizer-Add-TargetTransformInfo.patch>
> >>>>>>> <0005-TLI-Factor-out-sanitizeFunctionName.-NFC.patch>
> >>>>>>> <0006-TLI-Add-interface-for-querying-whether-a-function-is.patch>
> >>>>>>> <0007-TTI-Add-getCallInstrCost.patch>
> >>>>>>> <0008-LoopVectorize-teach-loop-vectorizer-to-vectorize-cal.patch>
> >>>>>>> <0009-TTI-Honour-cost-model-for-estimating-cost-of-vector-.patch>
> >>>>>>> <0010-TLI-Add-several-entries-to-VectorizableFunctions-tab.patch>
> >>>>>> 
> >>>>>> Some comments on the code:
> >>>>>> 
> >>>>>> 
> >>>>>> This function could use some documentation. What are Tys?
> >>>>>> (Argument types).
> >>>>> I’ll add some, thanks.
> >>>>>> 
> >>>>>> + unsigned getCallInstrCost(Function *F, Type *RetTy,
> >>>>>> ArrayRef<Type *> Tys,
> >>>>>> + const TargetLibraryInfo *TLI) {
> >>>>>> +
> >>>>>> + // Scalar function calls are always expensive.
> >>>>>> + if (!RetTy->isVectorTy())
> >>>>>> + return 10;
> >>>>>> +
> >>>>>> + // Functions with a vector form are no more expensive than a
> >>>>>> scalar call.
> >>>>>> + if (TLI &&
> >>>>>> + TLI->isFunctionVectorizable(F->getName(),
> >>>>>> + RetTy->getVectorNumElements()))
> >>>>>> + return 10;
> >>>>>> +
> >>>>>> + // We have to scalarize this function call. Estimate the
> >>>>>> cost.
> >>>>>> + unsigned ScalarizationCost = getScalarizationOverhead(RetTy,
> >>>>>> true, false);
> >>>>>> + unsigned ScalarCalls = RetTy->getVectorNumElements();
> >>>>>> + for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
> >>>>>> + if (Tys[i]->isVectorTy()) {
> >>>>>> + ScalarizationCost += getScalarizationOverhead(Tys[i], false,
> >>>>>> true);
> >>>>>> 
> >>>>>> This is not obvious. Why could we have more elements in the
> >>>>>> vector type of a function arguments than the vector return
> >>>>>> type
> >>>>>> when vectorizing a function such that we need more function
> >>>>>> calls when scalarizing?
> >>>>> It’s an attempt to handle ‘unusual’ functions here. Possible
> >>>>> examples would be converts, where input element size differs
> >>>>> from output element size, or reductions.I think that for now we
> >>>>> can just assert that we only deal with functions where operand
> >>>>> types are the same as the result type.
> >>>>> 
> >>>>>> 
> >>>>>> + ScalarCalls = std::max(ScalarCalls,
> >>>>>> Tys[i]->getVectorNumElements());
> >>>>>> + }
> >>>>>> + }
> >>>>>> +
> >>>>>> + return ScalarCalls * 10 + ScalarizationCost;
> >>>>>> + }
> >>>>>> 
> >>>>>> 
> >>>>>> There seems to be some repetition in the code below. Can we
> >>>>>> refactor that?
> >>>>> After re-examining this code I spotted strange part (I didn’t
> >>>>> change it in the patches):
> >>>>> Type *Tys[] = {CI->getType()};
> >>>>> if (VF > 1)
> >>>>> Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
> >>>>> I assume that only works for instrinsics with one argument, and
> >>>>> those are the only cases we handle now. But shouldn’t we guard
> >>>>> it with an assert or try to handle general case here?
> >>>>> 
> >>>>> Thanks,
> >>>>> Michael
> >>>>> 
> >>>>>> 
> >>>>>> + default:
> >>>>>> + bool HasScalarOpd = hasVectorInstrinsicScalarOpd(ID, 1);
> >>>>>> + for (unsigned Part = 0; Part < UF; ++Part) {
> >>>>>> + SmallVector<Value *, 4> Args;
> >>>>>> + for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie;
> >>>>>> ++i) {
> >>>>>> + if (HasScalarOpd && i == 1) {
> >>>>>> + Args.push_back(CI->getArgOperand(i));
> >>>>>> + continue;
> >>>>>> + }
> >>>>>> + VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
> >>>>>> + Args.push_back(Arg[Part]);
> >>>>>> + }
> >>>>>> + Type *Tys[] = {CI->getType()};
> >>>>>> + if (VF > 1)
> >>>>>> + Tys[0] = VectorType::get(CI->getType()->getScalarType(),
> >>>>>> VF);
> >>>>>> +
> >>>>>> + Function *F = Intrinsic::getDeclaration(M, ID, Tys);
> >>>>>> + Entry[Part] = Builder.CreateCall(F, Args);
> >>>>>> + }
> >>>>>> +
> >>>>>> + propagateMetadata(Entry, it);
> >>>>>> + break;
> >>>>>> + }
> >>>>>> + // Otherwise, perform the lib call.
> >>>>>> + } else if (TLI && TLI->isFunctionVectorizable(FnName, VF)) {
> >>>>>> + // This is a function with a vector form.
> >>>>>> + StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
> >>>>>> + assert(!VFnName.empty());
> >>>>>> +
> >>>>>> + Function *VectorF = M->getFunction(VFnName);
> >>>>>> + if (!VectorF) {
> >>>>>> + // Generate a declaration
> >>>>>> + FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
> >>>>>> + VectorF =
> >>>>>> + Function::Create(FTy, Function::ExternalLinkage, VFnName,
> >>>>>> M);
> >>>>>> + assert(VectorF);
> >>>>>> + }
> >>>>>> +
> >>>>>> for (unsigned Part = 0; Part < UF; ++Part) {
> >>>>>> SmallVector<Value *, 4> Args;
> >>>>>> for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie;
> >>>>>> ++i)
> >>>>>> {
> >>>>>> - if (HasScalarOpd && i == 1) {
> >>>>>> - Args.push_back(CI->getArgOperand(i));
> >>>>>> - continue;
> >>>>>> - }
> >>>>>> VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
> >>>>>> Args.push_back(Arg[Part]);
> >>>>>> }
> >>>>>> - Type *Tys[] = {CI->getType()};
> >>>>>> - if (VF > 1)
> >>>>>> - Tys[0] = VectorType::get(CI->getType()->getScalarType(),
> >>>>>> VF);
> >>>>>> -
> >>>>>> - Function *F = Intrinsic::getDeclaration(M, ID, Tys);
> >>>>>> - Entry[Part] = Builder.CreateCall(F, Args);
> >>>>>> + Entry[Part] = Builder.CreateCall(VectorF, Args);
> >>>>>> }
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> llvm-commits mailing list
> >>>>> llvm-commits at cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>> 
> >>> 
> >> 
> >> 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory