[llvm] r176403 - X86 cost model: Adjust cost for custom lowered vector multiplies

Sat Mar 2 01:41:27 PST 2013

On 02.03.2013, at 05:02, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:

> Author: arnolds
> Date: Fri Mar  1 22:02:52 2013
> New Revision: 176403
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=176403&view=rev
> Log:
> X86 cost model: Adjust cost for custom lowered vector multiplies
> 
> This matters for example in following matrix multiply:
> 
> int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
>  int i, j, k, val;
>  for (i=0; i<rows; i++) {
>    for (j=0; j<cols; j++) {
>      val = 0;
>      for (k=0; k<cols; k++) {
>        val += m1[i][k] * m2[k][j];
>      }
>      m3[i][j] = val;
>    }
>  }
>  return(m3);
> }
> 
> Taken from the test-suite benchmark Shootout.
> 
> We estimate the cost of the multiply to be 2 while we generate 9 instructions
> for it and end up being quite a bit slower than the scalar version (48% on my
> machine).
> 
> Also, properly differentiate between avx1 and avx2. On avx-1 we still split the
> vector into 2 128bits and handle the subvector muls like above with 9
> instructions.
> Only on avx-2 will we have a cost of 9 for v4i64.
> 
> I changed the test case in test/Transforms/LoopVectorize/X86/avx1.ll to use an
> add instead of a mul because with a mul we now no longer vectorize. I did
> verify that the mul would be indeed more expensive when vectorized with 3
> kernels:
> 
> for (i ...)
>   r += a[i] * 3;
> for (i ...)
>  m1[i] = m1[i] * 3; // This matches the test case in avx1.ll
> and a matrix multiply.
> 
> In each case the vectorized version was considerably slower.

Thanks for doing this, the matrix multiply benchmark was embarrassing.

If we look at the graph for one of the matrix shootout benchs, the bots use -march=core2
http://llvm.org/perf/db_default/v4/nts/graph?plot.0=7.64.2&highlight_run=9046

- We start with the unvectorized loop as our baseline, then we started vectorizing (slowdown with two little peaks at the beginning). At that point we didn't have the pmuludq lowering and got slow scalarized multiplies.
- The drop in the middle is when I added pmuludq lowering, note that at this time the vectorized code was faster than the scalar version.
- Then we got slower again, I forgot if this was unrolling or some unrelated cost tweak.
- Now we're back at the scalar code.

I don't know how representative those benchmarks are for real-world code, but it might be worth looking what happened in the middle and try to reproduce it. OTOH it could be a microarchitectural glitch and not be worth the hassle.

- Ben
> 
> radar://13304919
> 
> Modified:
>    llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
>    llvm/trunk/test/Analysis/CostModel/X86/arith.ll
>    llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll
> 
> Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=176403&r1=176402&r2=176403&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
> +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Fri Mar  1 22:02:52 2013
> @@ -176,18 +176,42 @@ unsigned X86TTI::getArithmeticInstrCost(
>     { ISD::MUL,     MVT::v8i32,    4 },
>     { ISD::SUB,     MVT::v8i32,    4 },
>     { ISD::ADD,     MVT::v8i32,    4 },
> -    { ISD::MUL,     MVT::v4i64,    4 },
>     { ISD::SUB,     MVT::v4i64,    4 },
>     { ISD::ADD,     MVT::v4i64,    4 },
> -    };
> +    // A v4i64 multiply is custom lowered as two split v2i64 vectors that then
> +    // are lowered as a series of long multiplies(3), shifts(4) and adds(2)
> +    // Because we believe v4i64 to be a legal type, we must also include the
> +    // split factor of two in the cost table. Therefore, the cost here is 18
> +    // instead of 9.
> +    { ISD::MUL,     MVT::v4i64,    18 },
> +  };
> 
>   // Look for AVX1 lowering tricks.
> -  if (ST->hasAVX()) {
> -    int Idx = CostTableLookup<MVT>(AVX1CostTable, array_lengthof(AVX1CostTable), ISD,
> -                          LT.second);
> +  if (ST->hasAVX() && !ST->hasAVX2()) {
> +    int Idx = CostTableLookup<MVT>(AVX1CostTable, array_lengthof(AVX1CostTable),
> +                                   ISD, LT.second);
>     if (Idx != -1)
>       return LT.first * AVX1CostTable[Idx].Cost;
>   }
> +
> +  // Custom lowering of vectors.
> +  static const CostTblEntry<MVT> CustomLowered[] = {
> +    // A v2i64/v4i64 and multiply is custom lowered as a series of long
> +    // multiplies(3), shifts(4) and adds(2).
> +    { ISD::MUL,     MVT::v2i64,    9 },
> +    { ISD::MUL,     MVT::v4i64,    9 },
> +  };
> +  int Idx = CostTableLookup<MVT>(CustomLowered, array_lengthof(CustomLowered),
> +                                 ISD, LT.second);
> +  if (Idx != -1)
> +    return LT.first * CustomLowered[Idx].Cost;
> +
> +  // Special lowering of v4i32 mul on sse2, sse3: Lower v4i32 mul as 2x shuffle,
> +  // 2x pmuludq, 2x shuffle.
> +  if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&
> +      !ST->hasSSE41())
> +    return 6;
> +
>   // Fallback to the default implementation.
>   return TargetTransformInfo::getArithmeticInstrCost(Opcode, Ty);
> }
> 
> Modified: llvm/trunk/test/Analysis/CostModel/X86/arith.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/arith.ll?rev=176403&r1=176402&r2=176403&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/arith.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/arith.ll Fri Mar  1 22:02:52 2013
> @@ -1,4 +1,6 @@
> ; RUN: opt < %s  -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx | FileCheck %s
> +; RUN: opt < %s  -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mcpu=core2 | FileCheck %s --check-prefix=SSE3
> +; RUN: opt < %s  -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mcpu=core-avx2 | FileCheck %s --check-prefix=AVX2
> 
> target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> target triple = "x86_64-apple-macosx10.8.0"
> @@ -32,7 +34,37 @@ define i32 @xor(i32 %arg) {
>   ret i32 undef
> }
> 
> +; CHECK: mul
> +define void @mul() {
> +  ; A <2 x i32> gets expanded to a <2 x i64> vector.
> +  ; A <2 x i64> vector multiply is implemented using
> +  ; 3 PMULUDQ and 2 PADDS and 4 shifts.
> +  ;CHECK: cost of 9 {{.*}} mul
> +  %A0 = mul <2 x i32> undef, undef
> +  ;CHECK: cost of 9 {{.*}} mul
> +  %A1 = mul <2 x i64> undef, undef
> +  ;CHECK: cost of 18 {{.*}} mul
> +  %A2 = mul <4 x i64> undef, undef
> +  ret void
> +}
> +
> +; SSE3: sse3mull
> +define void @sse3mull() {
> +  ; SSE3: cost of 6 {{.*}} mul
> +  %A0 = mul <4 x i32> undef, undef
> +  ret void
> +  ; SSE3: avx2mull
> +}
> +
> +; AVX2: avx2mull
> +define void @avx2mull() {
> +  ; AVX2: cost of 9 {{.*}} mul
> +  %A0 = mul <4 x i64> undef, undef
> +  ret void
> +  ; AVX2: fmul
> +}
> 
> +; CHECK: fmul
> define i32 @fmul(i32 %arg) {
>   ;CHECK: cost of 1 {{.*}} fmul
>   %A = fmul <4 x float> undef, undef
> 
> Modified: llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll?rev=176403&r1=176402&r2=176403&view=diff
> ==============================================================================
> --- llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll (original)
> +++ llvm/trunk/test/Transforms/LoopVectorize/X86/avx1.ll Fri Mar  1 22:02:52 2013
> @@ -27,7 +27,7 @@ define i32 @read_mod_write_single_ptr(fl
> 
> 
> ;CHECK: @read_mod_i64
> -;CHECK: load <4 x i64>
> +;CHECK: load <2 x i64>
> ;CHECK: ret i32
> define i32 @read_mod_i64(i64* nocapture %a, i32 %n) nounwind uwtable ssp {
>   %1 = icmp sgt i32 %n, 0
> @@ -37,7 +37,7 @@ define i32 @read_mod_i64(i64* nocapture
>   %indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]
>   %2 = getelementptr inbounds i64* %a, i64 %indvars.iv
>   %3 = load i64* %2, align 4
> -  %4 = mul i64 %3, 3
> +  %4 = add i64 %3, 3
>   store i64 %4, i64* %2, align 4
>   %indvars.iv.next = add i64 %indvars.iv, 1
>   %lftr.wideiv = trunc i64 %indvars.iv.next to i32
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits