[PATCH] D70456: [Matrix] Add first set of matrix intrinsics and initial lowering pass.

Fri Apr 3 10:47:31 PDT 2020

fhahn added a comment.

In D70456#1959561 <https://reviews.llvm.org/D70456#1959561>, @LuoYuanke wrote:

> I have the similar question on how to lower matrix intrinsics to some HW specific intrinsics/instruction. For example, X86 have the AVX512_VNNI feature (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=39,5370,5361,364,142,139,2210&text=vnni). It can perform dot product computation. But after matrix intrinsic is lowered to vector, it seems difficult to transform the vector operation to AVX512_VNNI intrinsic/instruction.

For example, assume we have an imaginary `float  @llvm.dot(<2 x float>, <2 x float> )` that computes the dot product of the 2 arguments and we would like to lower `@llvm.matrix.multiply(<4 x float> %a, <4 x float> %b, 2, 2, 2)` using @llvm.dot. Currently, the LowerMatrixIntrinsics pass is where this needs to happen, similar to the tiling patch (D75566 <https://reviews.llvm.org/D75566>). You could add a separate `emitMultiplyUsingLLVMDot()` which would generate something like

  %a.row.0 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 0, i32 2>
  %a.row.1 = shufflevector <4 x float> undef, <4 x float> %a, <2 x i32> <i32 1, i32 3>
  %b.col.0 =  shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 0, i32 1>
  %b.col.1 =  shufflevector <4 x float> undef, <4 x float> %b, <2 x i32> <i32 2, i32 3>

  %r.0.0 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.0)
  %res.1 = insertelement <4 x float> undef, float %r.0.0, i32 0
  %r.1.0 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.0)
  %res.2 = insertelement <4 x float> %res.1, float %r.1.0, i32 1
  %r.0.1 = call float @llvm.dot(<2 x float> %a.row.0, <2 x float> %b.col.1)
  %res.3 = insertelement <4 x float> %res.2, float %r.0.1, i32 2
  %r.1.1 = call float @llvm.dot(<2 x float> %a.row.1, <2 x float> %b.col.1)
  %res.4 = insertelement <4 x float> %res.3, float %r.1.1, i32 3

We used something similar internally successfully. If you are interested, I could share infrastructure to create code that applies smaller building blocks (like fast 2x2 multiplication) to lower multiplies on larger matrixes.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70456/new/

https://reviews.llvm.org/D70456