# RFC: Enable vectorization of call instructions in the loop vectorizer

Michael Zolotukhin mzolotukhin at apple.com
Mon Mar 2 11:38:38 PST 2015

```Hi Eric,

I guess you’re talking about 0009 patch, where I changed cost model for intrinsics and adjusted tests accordingly (please correct me if I’m wrong).

We have several constants in this patch (again, I’m not sure which of them are you talking about, so I’ll try to explain all of them):

1. 10 in 'return ScalarCalls * 10 + ScalarizationCost’. The constant 10 is indeed magic here, but this constant is used everywhere for a cost of a call. It might be a good idea to parametrize it, but I’d say that would be an independent change then.

2. Concrete costs in the tests, like these:
-; CORE2: Cost Model: Found an estimated cost of 400 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)
+; CORE2: Cost Model: Found an estimated cost of 46 for instruction:   %2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %wide.load)

The idea of the test is to check that if an intrinsics maps well to a vector instruction, it should be much cheaper, than in the case where it should be scalarized. In the original test the cheap cost is 1, and the expensive cost is 400. In the modified test the cheap cost is 1, and the expensive cost is 46. I think we can replace ’46’ with a regexp {{[1-9][0-9]+}}, which will match 2-digit numbers, and that would be more general. What do you think?

3. Constants in scalarize.ll tests, e.g. these:
-; CHECK32: cost of 12 {{.*}}bswap.v4i32
-; CHECK64: cost of 12 {{.*}}bswap.v4i32
+; CHECK32: cost of 48 {{.*}}bswap.v4i32
+; CHECK64: cost of 48 {{.*}}bswap.v4i32

This test checks the specific costs, so if we change the cost model, we need to change constants here. Some constants have changed quite significantly in this test, that’s why I especially wanted to hear opinions on the change of the cost model. I still need to measure performance impact of this change, but if there are no regressions, I’d assume nobody really relies on exactly these values, and the new ones would serve perfectly fine.

It also worth noticing, that it’s not just a random change in the cost model - the intention is to make it more accurate. We have parameters like ScalarizationOverhead, we know number of scalar calls and we know cost of each scalar cost, but we didn’t use that info, and ended up with overly conservative cost (10 * CostOfCall * NumOfCalls). With this change the new formula is (CostOfCall*NumOfCall + ScalarizationCost), which should be pretty accurate, providing that CostOfCall and ScalarizationCall are accurate.

Thanks,
Michael

> On Feb 28, 2015, at 1:14 PM, Eric Christopher <echristo at gmail.com> wrote:
>
>
>
> On Fri, Feb 27, 2015 at 4:10 PM Michael Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote:
> Hi Arnold,
>
> These are sligthly updated patches 0007 and 0008 - I added some description for getCallInstrCost and refactored changes in loop vectorizer. I don’t know if these changes actually look better now, but at least we don’t have duplicated code :)
>
>
>
> How'd you get the magic constants?
>
> -eric
>
> What is the best way to proceed with this set of the patches? Should I create a separate phabricator review for each of them?
>
> Also, the 0002 patch is an independent bug fix - is it ok to commit it?
>
> Thanks,
> Michael
>
>> On Feb 26, 2015, at 4:57 PM, Michael Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote:
>>
>> Hi Arnold,
>>
>>> On Feb 26, 2015, at 1:49 PM, Arnold Schwaighofer <aschwaighofer at apple.com <mailto:aschwaighofer at apple.com>> wrote:
>>>
>>> Hi Michael,
>>>
>>> thanks for following up on this. I think it is important to move forward with this.
>>>
>>>> On Feb 26, 2015, at 12:01 PM, Michael Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Some time ago James posted a patch for vectorization of calls [1]. Though it seems to be a missing feature in the vectorizer, it wasn’t committed, so I decided to revive the discussion.
>>>>
>>>> To begin with, I rebased the patch and split it to several semantically separated parts. The approach used here is pretty natural: TargetLibraryInfo has a mapping for a set of scalar functions to their vector counterparts, TargetTransformInfo consults with it when cost of call instruction is calculated, if the vector-version is cheaper, vectorizer emits vector-function.
>>>>
>>>> There are several question on this patch:
>>>> 1. Is this approach still viable in general and worth pursuing?
>>>
>>> I believe this is still important. We want to be able to vectorize function calls if vectorized versions of math library functions are available.
>> Thanks, I’ll continue digging this then:)
>>>
>>>
>>>> 2. The cost model is changed. In theory it should become more accurate, but I haven’t measured performance impact of this change yet. Does these changes look good in general? If yes, do we need to tune some target-cost-model parameters, or those are the values that no one has been actually using yet?
>>>
>>> I think we will have to test performance impact of this change since we are most likely the first clients. We should test with/without the target tuple that enables the vectorized function calls. I think if we don’t see egregious regressions in the run that uses the modified cost model but is a run without vectorized versions of the math library being available we can commit this patch set such that it gets testing by the community.
>> That sounds good.
>>>
>>>
>>>> 3. What is the best way to populate the scalar to vector functions map? For now, I populate it when the target-triple contains ‘Accelerate’. If we have a lot of other cases, we might think of something different here.
>>>
>>> We use the environment substring of the target tuple (in this case quadruple). This is in spirit to what we are already to conditionally turn on/off features in TargetLibraryInfo. I think we should address the issue of scalability when we have a client where this gets to be an issue.
>> Sounds good.
>>>
>>>
>>>>
>>>> Any feedback or ideas are appreciated!
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>>
>>>>
>>>> <0001-Make-ToVectorTy-static.patch>
>>>> <0002-Fix-a-copy-paste-bug.patch>
>>>> <0003-TLI-Use-lambda.-NFC.patch>
>>>> <0005-TLI-Factor-out-sanitizeFunctionName.-NFC.patch>
>>>> <0008-LoopVectorize-teach-loop-vectorizer-to-vectorize-cal.patch>
>>>> <0009-TTI-Honour-cost-model-for-estimating-cost-of-vector-.patch>
>>>
>>> Some comments on the code:
>>>
>>>
>>> This function could use some documentation. What are Tys? (Argument types).
>>>
>>> +  unsigned getCallInstrCost(Function *F, Type *RetTy, ArrayRef<Type *> Tys,
>>> +                            const TargetLibraryInfo *TLI) {
>>> +
>>> +    // Scalar function calls are always expensive.
>>> +    if (!RetTy->isVectorTy())
>>> +      return 10;
>>> +
>>> +    // Functions with a vector form are no more expensive than a scalar call.
>>> +    if (TLI &&
>>> +        TLI->isFunctionVectorizable(F->getName(),
>>> +                                    RetTy->getVectorNumElements()))
>>> +      return 10;
>>> +
>>> +    // We have to scalarize this function call. Estimate the cost.
>>> +    unsigned ScalarizationCost = getScalarizationOverhead(RetTy, true, false);
>>> +    unsigned ScalarCalls = RetTy->getVectorNumElements();
>>> +    for (unsigned i = 0, ie = Tys.size(); i != ie; ++i) {
>>> +      if (Tys[i]->isVectorTy()) {
>>> +        ScalarizationCost += getScalarizationOverhead(Tys[i], false, true);
>>>
>>> This is not obvious. Why could we have more elements in the vector type of a function arguments than the vector return type when vectorizing a function such that we need more function calls when scalarizing?
>> It’s an attempt to handle ‘unusual’ functions here. Possible examples would be converts, where input element size differs from output element size, or reductions.I think that for now we can just assert that we only deal with functions where operand types are the same as the result type.
>>
>>>
>>> +        ScalarCalls = std::max(ScalarCalls, Tys[i]->getVectorNumElements());
>>> +      }
>>> +    }
>>> +
>>> +    return ScalarCalls * 10 + ScalarizationCost;
>>> +  }
>>>
>>>
>>> There seems to be some repetition in the code below. Can we refactor that?
>> After re-examining this code I spotted strange part (I didn’t change it in the patches):
>>           Type *Tys[] = {CI->getType()};
>>           if (VF > 1)
>>             Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
>> I assume that only works for instrinsics with one argument, and those are the only cases we handle now. But shouldn’t we guard it with an assert or try to handle general case here?
>>
>> Thanks,
>> Michael
>>
>>>
>>> +        default:
>>> +          bool HasScalarOpd = hasVectorInstrinsicScalarOpd(ID, 1);
>>> +          for (unsigned Part = 0; Part < UF; ++Part) {
>>> +            SmallVector<Value *, 4> Args;
>>> +            for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
>>> +              if (HasScalarOpd && i == 1) {
>>> +                Args.push_back(CI->getArgOperand(i));
>>> +                continue;
>>> +              }
>>> +              VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
>>> +              Args.push_back(Arg[Part]);
>>> +            }
>>> +            Type *Tys[] = {CI->getType()};
>>> +            if (VF > 1)
>>> +              Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
>>> +
>>> +            Function *F = Intrinsic::getDeclaration(M, ID, Tys);
>>> +            Entry[Part] = Builder.CreateCall(F, Args);
>>> +          }
>>> +
>>> +          break;
>>> +        }
>>> +        // Otherwise, perform the lib call.
>>> +      } else if (TLI && TLI->isFunctionVectorizable(FnName, VF)) {
>>> +        // This is a function with a vector form.
>>> +        StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
>>> +        assert(!VFnName.empty());
>>> +
>>> +        Function *VectorF = M->getFunction(VFnName);
>>> +        if (!VectorF) {
>>> +          // Generate a declaration
>>> +          FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
>>> +          VectorF =
>>> +              Function::Create(FTy, Function::ExternalLinkage, VFnName, M);
>>> +          assert(VectorF);
>>> +        }
>>> +
>>>          for (unsigned Part = 0; Part < UF; ++Part) {
>>>            SmallVector<Value *, 4> Args;
>>>            for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
>>> -            if (HasScalarOpd && i == 1) {
>>> -              Args.push_back(CI->getArgOperand(i));
>>> -              continue;
>>> -            }
>>>              VectorParts &Arg = getVectorValue(CI->getArgOperand(i));
>>>              Args.push_back(Arg[Part]);
>>>            }
>>> -          Type *Tys[] = {CI->getType()};
>>> -          if (VF > 1)
>>> -            Tys[0] = VectorType::get(CI->getType()->getScalarType(), VF);
>>> -
>>> -          Function *F = Intrinsic::getDeclaration(M, ID, Tys);
>>> -          Entry[Part] = Builder.CreateCall(F, Args);
>>> +          Entry[Part] = Builder.CreateCall(VectorF, Args);
>>>          }
>>>
>>>
>>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150302/8994ccc3/attachment.html>
```