[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.
Everton Constantino via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Wed Mar 31 06:58:35 PDT 2021
everton.constantino added a comment.
@fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:
%vec.cast = bitcast [4 x float]* %A to <2 x float>*
%col.load = load <2 x float>, <2 x float>* %vec.cast, align 4
%vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2
%vec.cast2 = bitcast float* %vec.gep to <2 x float>*
%col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4
%vec.cast4 = bitcast [4 x float]* %B to <2 x float>*
%col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4
%vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2
%vec.cast7 = bitcast float* %vec.gep6 to <2 x float>*
%col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4
%splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x i32> zeroinitializer
%0 = fmul <2 x float> %col.load, %splat.splat
%splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat11, <2 x float> %0)
%splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 x i32> zeroinitializer
%2 = fmul <2 x float> %col.load, %splat.splat14
%splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat17, <2 x float> %2)
%vec.cast18 = bitcast [4 x float]* %C to <2 x float>*
%col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4
%vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2
%vec.cast21 = bitcast float* %vec.gep20 to <2 x float>*
%col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4
%4 = fadd <2 x float> %1, %col.load19
%5 = fadd <2 x float> %3, %col.load22
store <2 x float> %4, <2 x float>* %vec.cast18, align 4
store <2 x float> %5, <2 x float>* %vec.cast21, align 4
I don't see a simple, reliable pattern to match the operands of %4 with %0 for example, and this is what I meant by the splat in the middle. The pragma approach assumes that we´re always working with architectures that the better approach is to fuse the fmul and fadds. The problem here is what you have to decide is between preloading the accumulator or not. On IBM Power10´s MMA this would be pretty far from optimal, for example, because you have specific instructions to load accumulators.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D99433/new/
https://reviews.llvm.org/D99433
More information about the cfe-commits
mailing list