<div dir="ltr">Hi Arnold,<div><br></div><div>I've run the patch through LNT and I'm sufficiently satisfied that it doesn't regress performance. The variability in the test-suite makes it difficult to say for certain, however (I used a multi-sample of 5, and there were some regressions in the very short-running tests and some improvements too).</div>
<div><br></div><div>I made the change you asked, and attached are three patches. One for fixing the cost model (I added a scalarization cost here too), one for moving ToVectorTy around and the third is the full change (obviously I'll rebase the third on the first two once they're committed).</div>
<div><br></div><div>Is it now OK to commit?</div><div><br></div><div>Cheers,</div><div><br></div><div>James</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 17 March 2014 16:58, James Molloy <span dir="ltr"><<a href="mailto:James.Molloy@arm.com" target="_blank">James.Molloy@arm.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Arnold,<br>
<br>
I'm running the test suite now. I can create separate patches for the hunks you asked for - the final patch will then rely on those.<br>
<div class=""><br>
> If we remove the factor of 10 in the second hunk we should add a<br>
> scalarization cost, otherwise we would just be estimating the cost of the<br>
> scalar calls.<br>
<br>
</div>The "Cost" variable already has the scalarization cost of extracting the parameter to the scalar call from the vector value - the only thing that is missing is the cost of inserting all the scalar return values into a vector. I'll add that. With the "10 *", it was double counting cost and was ending up at 400 for vectorizing llvm.exp.f32 on X86... which is about 10 times the cost of vectorizing the libcall to expf()! So a test was failing.<br>
<br>
Cheers,<br>
<br>
James<br>
<div class=""><br>
> -----Original Message-----<br>
> From: Arnold Schwaighofer [mailto:<a href="mailto:aschwaighofer@apple.com">aschwaighofer@apple.com</a>]<br>
> Sent: 17 March 2014 16:51<br>
> To: James Molloy<br>
> Cc: Renato Golin; James Molloy; llvm-commits<br>
> Subject: Re: RFC: Enable vectorization of call instructions in the loop<br>
> vectorizer<br>
><br>
</div><div><div class="h5">> Overall this looks great. I have some comments below. Did you run the test<br>
> suite and made sure that no changes are observed?<br>
><br>
><br>
> @@ -434,7 +438,7 @@<br>
> for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {<br>
> if (Tys[i]->isVectorTy()) {<br>
> ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);<br>
> - ScalarCalls = std::max(ScalarCalls, RetTy->getVectorNumElements());<br>
> + ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());<br>
> }<br>
> }<br>
><br>
> @@ -493,13 +497,40 @@<br>
> unsigned Num = RetTy->getVectorNumElements();<br>
> unsigned Cost = TopTTI->getIntrinsicInstrCost(IID, RetTy->getScalarType(),<br>
> Tys);<br>
> - return 10 * Cost * Num;<br>
> + return Cost * Num;<br>
> }<br>
><br>
> // This is going to be turned into a library call, make it expensive.<br>
> return 10;<br>
> }<br>
><br>
><br>
> These two changes should go in as a separate patch. They are fixes to the<br>
> cost model.<br>
><br>
> If we remove the factor of 10 in the second hunk we should add a<br>
> scalarization cost, otherwise we would just be estimating the cost of the<br>
> scalar calls.<br>
><br>
><br>
><br>
> @@ -829,11 +833,6 @@<br>
> /// width. Vector width of one means scalar.<br>
> unsigned getInstructionCost(Instruction *I, unsigned VF);<br>
><br>
> - /// A helper function for converting Scalar types to vector types.<br>
> - /// If the incoming type is void, we return void. If the VF is 1, we return<br>
> - /// the scalar type.<br>
> - static Type* ToVectorTy(Type *Scalar, unsigned VF);<br>
> -<br>
> /// Returns whether the instruction is a load or store and will be a emitted<br>
> /// as a vector operation.<br>
> bool isConsecutiveLoadOrStore(Instruction *I);<br>
><br>
> @@ -1224,6 +1223,15 @@<br>
> return SE->getSCEV(Ptr);<br>
> }<br>
><br>
> +/// A helper function for converting Scalar types to vector types.<br>
> +/// If the incoming type is void, we return void. If the VF is 1, we return<br>
> +/// the scalar type.<br>
> +static Type* ToVectorTy(Type *Scalar, unsigned VF) {<br>
> + if (Scalar->isVoidTy() || VF == 1)<br>
> + return Scalar;<br>
> + return VectorType::get(Scalar, VF);<br>
> +}<br>
> +<br>
><br>
> This is cleanup and should be split into a separate patch.<br>
><br>
><br>
> Thanks for working on this.<br>
><br>
><br>
> On Mar 17, 2014, at 7:38 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk">james@jamesmolloy.co.uk</a>><br>
> wrote:<br>
><br>
> > Hi Arnold,<br>
> ><br>
> > Sorry for the large delay in this - I've been working on this in my spare time<br>
> and haven't had much of that lately! :)<br>
> ><br>
> > This version of the patch:<br>
> ><br>
> > * Addresses your three points in your previous email.<br>
> > * Adds support for the Accelerate library, but I only added support for one<br>
> function in it (expf) for testing purposes. There is a fixme for someone with<br>
> more Apple knowledge and ability to test than me to fill in the rest.<br>
> > * Updates to ToT and updates TargetLibraryInfo to use C++11 lambdas in<br>
> std::lower_bound rather than functors.<br>
> ><br>
> > Does it look better?<br>
> ><br>
> > Cheers,<br>
> ><br>
> > James<br>
> ><br>
> ><br>
> > On 17 January 2014 17:22, James Molloy <<a href="mailto:james@jamesmolloy.co.uk">james@jamesmolloy.co.uk</a>><br>
> wrote:<br>
> > Awesome, thanks Arnold! Very clear now.<br>
> ><br>
> ><br>
> > On 17 January 2014 16:45, Arnold Schwaighofer<br>
> <<a href="mailto:aschwaighofer@apple.com">aschwaighofer@apple.com</a>> wrote:<br>
> ><br>
> > On Jan 17, 2014, at 2:59 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk">james@jamesmolloy.co.uk</a>><br>
> wrote:<br>
> ><br>
> > > Hi Arnold,<br>
> > ><br>
> > > > First, we are going to have the situation where there exists an intrinsic<br>
> ID for a library function (many math library functions have an intrinsic version:<br>
> expf -> llvm.exp.32 for example). As a consequence "getIntrinsicIDForCall"<br>
> will return it. In this case we can have both: a vectorized library function<br>
> version and an intrinsic function that maybe slower or faster. In such a case<br>
> the cost model has to decide which one to pick. This means we have to query<br>
> the cost model which one is cheaper in two places: when get the instruction<br>
> cost and when we vectorize the call.<br>
> > ><br>
> > > Sure, I will address this.<br>
> > ><br>
> > > > Second, the way we test this. [snip]<br>
> > ><br>
> > > This is very sensible. The only reason I didn't go down this route to start<br>
> with was that I didn't know of an available library (like Accelerate) and didn't<br>
> want to add testing/dummy code in tree. Thanks for pointing me at<br>
> Accelerate - that'll give me a real library to (semi) implement and test.<br>
> > ><br>
> > > > This brings me to issue three. You are currently using TTI->getCallCost()<br>
> which is not meant to be used with the vectorizer. We should create a<br>
> getCallInstrCost() function similar to the "getIntrinsicInstrCost" function we<br>
> already have.<br>
> > > ><br>
> > > > BasicTTI::getCallInstrCost should query TLI->isFunctionVectorizable()<br>
> and return a sensible value in this case (one that is lower than a scalarized<br>
> intrinsic lowered as lib call).<br>
> > ><br>
> > > I don't understand the difference between getIntrinsicCost and<br>
> getIntrinsicInstrCost. They both take the same arguments (but return<br>
> different values), and the doxygen docstring does not describe the action in<br>
> enough detail to discern what the required behaviour is.<br>
> > ><br>
> > > Could you please tell me? (and I'll update the docstrings while I'm at it).<br>
> ><br>
> > Sure, TargetTransformInfo is split into two "cost" metrics:<br>
> ><br>
> > * Generic target information which returns its cost in terms of<br>
> "TargetCostConstants":<br>
> ><br>
> > /// \name Generic Target Information<br>
> > /// @{<br>
> ><br>
> > /// \brief Underlying constants for 'cost' values in this interface.<br>
> > ///<br>
> > /// Many APIs in this interface return a cost. This enum defines the<br>
> > /// fundamental values that should be used to interpret (and produce)<br>
> those<br>
> > /// costs. The costs are returned as an unsigned rather than a member of<br>
> this<br>
> > /// enumeration because it is expected that the cost of one IR instruction<br>
> > /// may have a multiplicative factor to it or otherwise won't fit directly<br>
> > /// into the enum. Moreover, it is common to sum or average costs which<br>
> works<br>
> > /// better as simple integral values. Thus this enum only provides<br>
> constants.<br>
</div></div>> > ...<br>
<div class="">> > /// @}<br>
> ><br>
> > This api is used by the inliner (getUserCost) to estimate the cost (size) of<br>
> instructions.<br>
> ><br>
> > * Throughput estimate for the vectorizer. This api attempts to estimate<br>
> (very crudely on a instruction per instruction basis) the throughput of<br>
> instructions (since we automatically infer most values using<br>
> TargetLoweringInfo, and we have to do this from IR this is not going to be<br>
</div>> very accurate ...).<br>
<div class="">> ><br>
> > /// \name Vector Target Information<br>
> > /// @{<br>
> > ...<br>
> > /// \return The expected cost of arithmetic ops, such as mul, xor, fsub, et<br>
> > virtual unsigned getArithmeticInstrCost(unsigned Opcode, Type *Ty,<br>
> > ...<br>
> > /// @}<br>
> ><br>
> > At a high level, this api tries to answer the question: What does this<br>
> instruction cost in a scalar form ("expf", f32). Or what does this instruction<br>
> cost in a vectorized form ("expf", <4 x float>).<br>
> ><br>
> > BasicTTI::getIntrinsicInstrCost() assumes a cost of 1 for intrinsics that have a<br>
> corresponding ISA instruction (TLoweringI-<br>
> >isOperationLegalOrPromote(ISD:FEXP) returns true), a cost of 10 for the<br>
> ones that don't and then we also incorporate things like type legalization<br>
> costs, and overhead if we vectorize.<br>
> ><br>
> > For the new BasicTTI::getCallInstrCost(Function, RetTy, ArgTys) we would<br>
> also return 10 for scalar versions of the function (RetTy->isVectorTy() ==<br>
> false).<br>
> > For vector queries (RetTy->isVectorTy()==true), if there is a a TLibInfo-<br>
> >isVectorizableFunction(Function->getCalledFunction->getName(), RetTy-<br>
> >getVectorNumElements()) we should also return 10. Otherwise, we<br>
> estimate the cost of scalarization just like we do in getIntrinsicInstrCost. This<br>
> will guarantee that the vectorize library function call (Cost = 10) will be<br>
> chosen over the intrinsic lowered to a sequence of scalarized lib calls (Cost =<br>
</div>> 10 * VF * ...).<br>
<div class="im HOEnZb">> ><br>
> > Then, in LoopVectorizationCostModel::getInstructionCost() you would<br>
> query both (if getInstrinsicIDForCall returns an id) apis and return the<br>
> smallest:<br>
> ><br>
> > case Call:<br>
> > CallInst *CI = cast<CallInst>(I);<br>
> ><br>
> ><br>
> > Type *RetTy = ToVectorTy(CI->getType(), VF);<br>
> > SmallVector<Type*, 4> Tys;<br>
> > for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i)<br>
> > Tys.push_back(ToVectorTy(CI->getArgOperand(i)->getType(), VF));<br>
> > unsigned LibFuncCallCost = TTI.getCallInstrCost(CI->getCalledFunction(),<br>
> RetTy, Tys);<br>
> ><br>
> > if (unsigned ID = getIntrinsicIDForCall(CI, TLI))<br>
> > return std::min(LibFuncCallCost, TTI.getIntrinsicInstrCost(ID, RetTy,<br>
> Tys));<br>
> > return LibFuncCallCost;<br>
> ><br>
> ><br>
> > Thanks,<br>
> > Arnold<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > <vectorize-calls.diff><br>
><br>
<br>
<br>
</div><div class="HOEnZb"><div class="h5">-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.<br>
<br>
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590<br>
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782<br>
<br>
</div></div></blockquote></div><br></div>