[PATCH] D70456: [Matrix] Add first set of matrix intrinsics and initial lowering pass.
Florian Hahn via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Apr 3 10:47:31 PDT 2020
fhahn added a comment.
In D70456#1959561 <https://reviews.llvm.org/D70456#1959561>, @LuoYuanke wrote:
> I have the similar question on how to lower matrix intrinsics to some HW specific intrinsics/instruction. For example, X86 have the AVX512_VNNI feature (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=39,5370,5361,364,142,139,2210&text=vnni). It can perform dot product computation. But after matrix intrinsic is lowered to vector, it seems difficult to transform the vector operation to AVX512_VNNI intrinsic/instruction.
For example, assume we have an imaginary `float @llvm.dot(<2 x float>, <2 x float> )` that computes the dot product of the 2 arguments and we would like to lower `@llvm.matrix.multiply(<4 x float> %a, <4 x float> %b, 2, 2, 2)` using @llvm.dot. Currently, the LowerMatrixIntrinsics pass is where this needs to happen, similar to the tiling patch (D75566 <https://reviews.llvm.org/D75566>). You could add a separate `emitMultiplyUsingLLVMDot()` which would generate something like
%a.row.0 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 0, i32 2>
%a.row.1 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 1, i32 3>
%b.col.0 = shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 0, i32 1>
%b.col.1 = shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 2, i32 3>
%r.0.0 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.0)
%res.1 = insertelement <4 x float> undef, float %r.0.0, i32 0
%r.1.0 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.0)
%res.2 = insertelement <4 x float> %res.1, float %r.1.0, i32 1
%r.0.1 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.1)
%res.3 = insertelement <4 x float> %res.2, float %r.0.1, i32 2
%r.1.1 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.1)
%res.4 = insertelement <4 x float> %res.3, float %r.1.1, i32 3
We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D70456/new/
https://reviews.llvm.org/D70456
More information about the llvm-commits
mailing list