[llvm] r204884 - [X86][Vectorizer Cost Model] Correct vectorization cost model for v2i64->v2f64

Thu Mar 27 12:27:29 PDT 2014

On Mar 27, 2014, at 10:54 AM, Adam Nemet <anemet at apple.com> wrote:

> 
> On Mar 27, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com> wrote:
> 
>> 
>> On Mar 27, 2014, at 8:52 AM, Adam Nemet <anemet at apple.com> wrote:
>> 
>>> 
>>> On Mar 26, 2014, at 5:52 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>> 
>>>> Author: qcolombet
>>>> Date: Wed Mar 26 19:52:16 2014
>>>> New Revision: 204884
>>>> 
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=204884&view=rev
>>>> Log:
>>>> [X86][Vectorizer Cost Model] Correct vectorization cost model for v2i64->v2f64
>>>> and v4i64->v4f64.
>>>> 
>>>> The new costs match what we did for SSE2 and reflect the reality of our codegen.
>>>> 
>>>> <rdar://problem/16381225>
>>>> 
>>>> Added:
>>>>   llvm/trunk/test/Transforms/LoopVectorize/X86/uint64_to_fp64-cost-model.ll
>>>> Modified:
>>>>   llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
>>>> 
>>>> Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
>>>> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=204884&r1=204883&r2=204884&view=diff
>>>> ==============================================================================
>>>> --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
>>>> +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Wed Mar 26 19:52:16 2014
>>>> @@ -512,6 +512,8 @@ unsigned X86TTI::getCastInstrCost(unsign
>>>>    { ISD::UINT_TO_FP,  MVT::v4f64, MVT::v4i8,  2 },
>>>>    { ISD::UINT_TO_FP,  MVT::v4f64, MVT::v4i16, 2 },
>>>>    { ISD::UINT_TO_FP,  MVT::v4f64, MVT::v4i32, 6 },
>>>> +    { ISD::UINT_TO_FP,  MVT::v2f64, MVT::v2i64, 2*10 },
>>>> +    { ISD::UINT_TO_FP,  MVT::v4f64, MVT::v4i64, 4*10 },
>>> 
>>> I think that a comment would be in order here.  If this node is codegen’ed as Expand rather than by the backend then for example why doesn’t the generic cost estimate (BasicTTI) give the right result?
>> BasicTTI estimates the cost of VF 2 to 4, and VF 4 to 10. One can argue that this is a very optimistic computation based on the actual codegen!
>> 
>> In  X86TTI::getCastInstrCost, there is already a comment about that for SSE2:
>>     // These are somewhat magic numbers justified by looking at the output of
>>     // Intel's IACA, running some kernels and making sure when we take
>>     // legalization into account the throughput will be overestimated.
>>     { ISD::UINT_TO_FP, MVT::v2f64, MVT::v2i64, 2*10 },
>> 
>> I’ve talked to Arnold and he said we basically need to put something expensive (10) and multiply it by the vector width.
>> I’ve matched the behavior of SSE2 here.
>> Do you think a comment along those lines would do?
> 
> I think that it would be good to point out where you think the generic code gets the number wrong.
Alright, I’m glad you pushed for a comment, there is indeed a bug in the generic cost computation in BasicTTI!

What is happening is that uitofp i64 to double is considered as Legal (cost 1), whereas it is Custom and it expands in 5~6 instructions.

** The Facts **

X86TargetLowering::resetOperationActions:
 if (Subtarget->is64Bit()) {
    setOperationAction(ISD::UINT_TO_FP     , MVT::i32  , Promote);
    setOperationAction(ISD::UINT_TO_FP     , MVT::i64  , Custom);    // <— Custom, given type: source type.
  }

 BasicTTI::getCastInstrCost:
  // If the cast is marked as legal (or promote) then assume low cost.
  if (TLI->isOperationLegalOrPromote(ISD, DstLT.second))       // <— Given type: destination type. No definition => Legal.
    return 1;

Fixing that would require a bit of work because we can’t simply do the query with the source type. Some cast instructions are defined with the destination type.
E.g.,
  if (Subtarget->is64Bit()) {
    setOperationAction(ISD::FP_TO_UINT     , MVT::i64  , Expand);
    setOperationAction(ISD::FP_TO_UINT     , MVT::i32  , Promote);
  }

** Conclusion **

The proper fix would likely to have a lookup table to know which type should be used for the query. This lookup table should be shared with the setOperationAction method.

I’ll file a PR and add a comment to this commit pointing out that we are working around this PR.

Thanks,
-Quentin
>  Then if we improve the generic estimate we’d know which cost estimates to revisit in the backend.
> 
> As you know I am working on a very similar bug and in my case the estimate for scalarization overhead is optimistic because it does not take the length of the dependence chain into consideration.  I am planning to point this out in the comment.
> 
> Adam
> 
>> 
>> 
>> ** Details on the Expansion Computation **
>> 
>> We are indeed failing in the expand case, which is the related code:
>> 
>>  BasicTargetTransformInfo.cpp:339
>>     // If we are converting vectors and the operation is illegal, or
>>     // if the vectors are legalized to different types, estimate the
>>     // scalarization costs.
>>     unsigned Num = Dst->getVectorNumElements();
>> The cast for the scalar type is legal and returns 1.
>> 
>>     unsigned Cost = TopTTI->getCastInstrCost(Opcode, Dst->getScalarType(),
>>                                              Src->getScalarType());
>> 
>>     // Return the cost of multiple scalar invocation plus the cost of
>>     // inserting and extracting the values.
>> Then the scalarization overhead is estimated to 2 for VF 2 and 6 for VF 4:
>> The cost of Index 0 is free (per X86TTI::getVectorInstrCost) and the cost of other indexes is 2 (1 insert, 1 extract).
>> I.e.,
>> VF N = 0 + 2 * (N - 1) + N * 1 = 3 * N - 2
>> VF 2 = 3 * 2 - 2 = 4
>> VF 4 = 3 * 4 - 2 = 10
>>     return getScalarizationOverhead(Dst, true, true) + Num * Cost;
>> 
>> Thanks,
>> -Quentin
>>> 
>>> Adam
>>> 
>>>> 
>>>>    { ISD::FP_TO_SINT,  MVT::v8i8,  MVT::v8f32, 7 },
>>>>    { ISD::FP_TO_SINT,  MVT::v4i8,  MVT::v4f32, 1 },
>>>> 
>>>> Added: llvm/trunk/test/Transforms/LoopVectorize/X86/uint64_to_fp64-cost-model.ll
>>>> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/X86/uint64_to_fp64-cost-model.ll?rev=204884&view=auto
>>>> ==============================================================================
>>>> --- llvm/trunk/test/Transforms/LoopVectorize/X86/uint64_to_fp64-cost-model.ll (added)
>>>> +++ llvm/trunk/test/Transforms/LoopVectorize/X86/uint64_to_fp64-cost-model.ll Wed Mar 26 19:52:16 2014
>>>> @@ -0,0 +1,26 @@
>>>> +; RUN: opt < %s  -loop-vectorize -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx -S -debug-only=loop-vectorize 2>&1 | FileCheck %s
>>>> +; REQUIRES: asserts
>>>> +
>>>> +target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
>>>> +target triple = "x86_64-apple-macosx10.8.0"
>>>> +
>>>> +
>>>> +; CHECK: cost of 20 for VF 2 For instruction:   %conv = uitofp i64 %tmp to double
>>>> +; CHECK: cost of 40 for VF 4 For instruction:   %conv = uitofp i64 %tmp to double
>>>> +define void @uint64_to_double_cost(i64* noalias nocapture %a, double* noalias nocapture readonly %b) nounwind {
>>>> +entry:
>>>> +  br label %for.body
>>>> +for.body:
>>>> +  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
>>>> +  %arrayidx = getelementptr inbounds i64* %a, i64 %indvars.iv
>>>> +  %tmp = load i64* %arrayidx, align 4
>>>> +  %conv = uitofp i64 %tmp to double
>>>> +  %arrayidx2 = getelementptr inbounds double* %b, i64 %indvars.iv
>>>> +  store double %conv, double* %arrayidx2, align 4
>>>> +  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>>> +  %exitcond = icmp eq i64 %indvars.iv.next, 256
>>>> +  br i1 %exitcond, label %for.end, label %for.body
>>>> +
>>>> +for.end:
>>>> +  ret void
>>>> +}
>>>> 
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140327/71945ac8/attachment.html>