[llvm] r176403 - X86 cost model: Adjust cost for custom lowered vector multiplies

Sat Mar 2 06:08:50 PST 2013

Sent from my iPhone

On Mar 2, 2013, at 3:41 AM, Benjamin Kramer <benny.kra at gmail.com> wrote:
> 
> Thanks for doing this, the matrix multiply benchmark was embarrassing.
> 
> If we look at the graph for one of the matrix shootout benchs, the bots use -march=core2
> http://llvm.org/perf/db_default/v4/nts/graph?plot.0=7.64.2&highlight_run=9046
> 
> - We start with the unvectorized loop as our baseline, then we started vectorizing (slowdown with two little peaks at the beginning). At that point we didn't have the pmuludq lowering and got slow scalarized multiplies.
> - The drop in the middle is when I added pmuludq lowering, note that at this time the vectorized code was faster than the scalar version.
> - Then we got slower again, I forgot if this was unrolling or some unrelated cost tweak.

I have a suspicion that this was the change that added taking pointers into account when determining the widest type. 

http://llvm.org/viewvc/llvm-project?view=revision&revision=174377

We used to ignore pointers: the biggest type in matrixmult would have been i32. Now, due to including the load of a pointer to the matrix it is i64. This would have changed the vector factor down from 4 to 2. One could force a vector width of 4 and see whether the drop is reproducible this way.

At a vector width of 4 we would vectorize the multiplies to 4vi32 muls. They should stay like this during lowering because v4i32 is a legal type.

At a vector width of 2 we would promote v2i32 to v2i64. Now we end up with the long 9 instruction i64 vector multiply. :(

It is on my list to investigate this: recognize cases where we actually had v2i32 multiply and the type legalizer promoted it to a v2i64 multiply and emit better code (in this case we don't care for the upper bits). But for now I am trying to root out a many bad regressions as I can. If somebody wants to beat me to it, I am a happy camper :).

Best,
Arnold

> - Now we're back at the scalar code.
> 
> I don't know how representative those benchmarks are for real-world code, but it might be worth looking what happened in the middle and try to reproduce it. OTOH it could be a microarchitectural glitch and not be worth the hassle.
> 
> - Ben
>> 
>> radar://13304919
>> 
>> Modified:
>> llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
>> llvm/trunk/test/Analysis/CostModel/X86/arith.ll
>> llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll
>> 
>> Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
>> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=176403&r1=176402&r2=176403&view=diff
>> ==============================================================================
>> --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
>> +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Fri Mar  1 22:02:52 2013
>> @@ -176,18 +176,42 @@ unsigned X86TTI::getArithmeticInstrCost(
>>  { ISD::MUL,     MVT::v8i32,    4 },
>>  { ISD::SUB,     MVT::v8i32,    4 },
>>  { ISD::ADD,     MVT::v8i32,    4 },
>> -    { ISD::MUL,     MVT::v4i64,    4 },
>>  { ISD::SUB,     MVT::v4i64,    4 },
>>  { ISD::ADD,     MVT::v4i64,    4 },
>> -    };
>> +    // A v4i64 multiply is custom lowered as two split v2i64 vectors that then
>> +    // are lowered as a series of long multiplies(3), shifts(4) and adds(2)
>> +    // Because we believe v4i64 to be a legal type, we must also include the
>> +    // split factor of two in the cost table. Therefore, the cost here is 18
>> +    // instead of 9.
>> +    { ISD::MUL,     MVT::v4i64,    18 },
>> +  };
>> 
>> // Look for AVX1 lowering tricks.
>> -  if (ST->hasAVX()) {
>> -    int Idx = CostTableLookup<MVT>(AVX1CostTable, array_lengthof(AVX1CostTable), ISD,
>> -                          LT.second);
>> +  if (ST->hasAVX() && !ST->hasAVX2()) {
>> +    int Idx = CostTableLookup<MVT>(AVX1CostTable, array_lengthof(AVX1CostTable),
>> +                                   ISD, LT.second);
>>  if (Idx != -1)
>>    return LT.first * AVX1CostTable[Idx].Cost;
>> }
>> +
>> +  // Custom lowering of vectors.
>> +  static const CostTblEntry<MVT> CustomLowered[] = {
>> +    // A v2i64/v4i64 and multiply is custom lowered as a series of long
>> +    // multiplies(3), shifts(4) and adds(2).
>> +    { ISD::MUL,     MVT::v2i64,    9 },
>> +    { ISD::MUL,     MVT::v4i64,    9 },
>> +  };
>> +  int Idx = CostTableLookup<MVT>(CustomLowered, array_lengthof(CustomLowered),
>> +                                 ISD, LT.second);
>> +  if (Idx != -1)
>> +    return LT.first * CustomLowered[Idx].Cost;
>> +
>> +  // Special lowering of v4i32 mul on sse2, sse3: Lower v4i32 mul as 2x shuffle,
>> +  // 2x pmuludq, 2x shuffle.
>> +  if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&
>> +      !ST->hasSSE41())
>> +    return 6;
>> +
>> // Fallback to the default implementation.
>> return TargetTransformInfo::getArithmeticInstrCost(Opcode, Ty);
>> }
>> 
>> Modified: llvm/trunk/test/Analysis/CostModel/X86/arith.ll
>> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/arith.ll?rev=176403&r1=176402&r2=176403&view=diff
>> ==============================================================================
>> --- llvm/trunk/test/Analysis/CostModel/X86/arith.ll (original)
>> +++ llvm/trunk/test/Analysis/CostModel/X86/arith.ll Fri Mar  1 22:02:52 2013
>> @@ -1,4 +1,6 @@
>> ; RUN: opt < %s  -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx | FileCheck %s
>> +; RUN: opt < %s  -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mcpu=core2 | FileCheck %s --check-prefix=SSE3
>> +; RUN: opt < %s  -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mcpu=core-avx2 | FileCheck %s --check-prefix=AVX2
>> 
>> target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
>> target triple = "x86_64-apple-macosx10.8.0"
>> @@ -32,7 +34,37 @@ define i32 @xor(i32 %arg) {
>> ret i32 undef
>> }
>> 
>> +; CHECK: mul
>> +define void @mul() {
>> +  ; A <2 x i32> gets expanded to a <2 x i64> vector.
>> +  ; A <2 x i64> vector multiply is implemented using
>> +  ; 3 PMULUDQ and 2 PADDS and 4 shifts.
>> +  ;CHECK: cost of 9 {{.*}} mul
>> +  %A0 = mul <2 x i32> undef, undef
>> +  ;CHECK: cost of 9 {{.*}} mul
>> +  %A1 = mul <2 x i64> undef, undef
>> +  ;CHECK: cost of 18 {{.*}} mul
>> +  %A2 = mul <4 x i64> undef, undef
>> +  ret void
>> +}
>> +
>> +; SSE3: sse3mull
>> +define void @sse3mull() {
>> +  ; SSE3: cost of 6 {{.*}} mul
>> +  %A0 = mul <4 x i32> undef, undef
>> +  ret void
>> +  ; SSE3: avx2mull
>> +}
>> +
>> +; AVX2: avx2mull
>> +define void @avx2mull() {
>> +  ; AVX2: cost of 9 {{.*}} mul
>> +  %A0 = mul <4 x i64> undef, undef
>> +  ret void
>> +  ; AVX2: fmul
>> +}
>> 
>> +; CHECK: fmul
>> define i32 @fmul(i32 %arg) {
>> ;CHECK: cost of 1 {{.*}} fmul
>> %A = fmul <4 x float> undef, undef
>> 
>> Modified: llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll
>> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll?rev=176403&r1=176402&r2=176403&view=diff
>> ==============================================================================
>> --- llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll (original)
>> +++ llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll Fri Mar  1 22:02:52 2013
>> @@ -27,7 +27,7 @@ define i32 @read_mod_write_single_ptr(fl
>> 
>> 
>> ;CHECK: @read_mod_i64
>> -;CHECK: load <4 x i64>
>> +;CHECK: load <2 x i64>
>> ;CHECK: ret i32
>> define i32 @read_mod_i64(i64* nocapture %a, i32 %n) nounwind uwtable ssp {
>> %1 = icmp sgt i32 %n, 0
>> @@ -37,7 +37,7 @@ define i32 @read_mod_i64(i64* nocapture
>> %indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]
>> %2 = getelementptr inbounds i64* %a, i64 %indvars.iv
>> %3 = load i64* %2, align 4
>> -  %4 = mul i64 %3, 3
>> +  %4 = add i64 %3, 3
>> store i64 %4, i64* %2, align 4
>> %indvars.iv.next = add i64 %indvars.iv, 1
>> %lftr.wideiv = trunc i64 %indvars.iv.next to i32
>> 
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>