[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.

Wed Mar 31 06:58:35 PDT 2021

everton.constantino added a comment.

@fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:

  %vec.cast = bitcast [4 x float]* %A to <2 x float>*
  %col.load = load <2 x float>, <2 x float>* %vec.cast, align 4
  %vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2
  %vec.cast2 = bitcast float* %vec.gep to <2 x float>*
  %col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4
  %vec.cast4 = bitcast [4 x float]* %B to <2 x float>*
  %col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4
  %vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2
  %vec.cast7 = bitcast float* %vec.gep6 to <2 x float>*
  %col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4
  %splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x i32> zeroinitializer
  %0 = fmul <2 x float> %col.load, %splat.splat
  %splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat11, <2 x float> %0)
  %splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 x i32> zeroinitializer
  %2 = fmul <2 x float> %col.load, %splat.splat14
  %splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat17, <2 x float> %2)
  %vec.cast18 = bitcast [4 x float]* %C to <2 x float>*
  %col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4
  %vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2
  %vec.cast21 = bitcast float* %vec.gep20 to <2 x float>*
  %col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4
  %4 = fadd <2 x float> %1, %col.load19
  %5 = fadd <2 x float> %3, %col.load22
  store <2 x float> %4, <2 x float>* %vec.cast18, align 4
  store <2 x float> %5, <2 x float>* %vec.cast21, align 4

I don't see a simple, reliable pattern to match the operands of %4 with %0 for example, and this is what I meant by the splat in the middle. The pragma approach assumes that we´re always working with architectures that the better approach is to fuse the fmul and fadds. The problem here is what you have to decide is between preloading the accumulator or not. On IBM Power10´s MMA this would be pretty far from optimal, for example, because you have specific instructions to load accumulators.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99433/new/

https://reviews.llvm.org/D99433