[cfe-dev] [llvm-dev] RFC: First-class Matrix type

Thu Oct 11 00:58:29 PDT 2018

This sounds like it would be really useful for 3D Graphics APIs as SPIR-V
(the Vulkan/OpenCL2.1 intermediate representation) has matrices as
first-class types. This will promote round-trip-ability. I have also heard
that RISC-V's V extension wants to support matrices (not sure if it's as an
additional extension on top of V or as part of V proper).

For the ABI, I recommend considering compatibility with the layouts
required for 3D Graphics APIs. For Vulkan, I think the relevant section in
the spec is 14.5.4.

Jacob Lifshay

On Wed, Oct 10, 2018 at 11:10 PM Adam Nemet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi,
>
> We are proposing first-class type support for a new matrix type.  This is
> a natural extension of the current vector type with an extra dimension.
> For example, this is what the IR for a matrix multiply would look like for
> a 4x4 matrix with element type float:
>
> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
> %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4
> x 4 x float> %0, <4 x 4 x float> %1)
> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
>
>
> Currently we support element-wise binary operations, matrix multiply,
> matrix-scalar multiply, matrix transpose, extract/insert of an element.
> Besides the regular full-matrix load and store, we also support loading and
> storing a matrix as a submatrix of a larger matrix in memory.  We are also
> planning to implement vector-extract/insert and matrix-vector multiply.
>
> All of these are currently implemented as intrinsics.  Where applicable we
> also plan to support these operations with native IR instructions (e.g.
> add/fadd).
>
> These are exposed in clang via builtins.  E.g. the above operations looks
> like this in C/C++:
>
> typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
>
> mf4x4_t add(mf4x4_t a, mf4x4_t b) {
>   return __builtin_matrix_multiply(a, b);
> }
>
>
> ** Benefits **
>
> Having matrices represented as IR values allows for the usual algebraic
> and redundancy optimizations.  But most importantly, by lifting memory
> aliasing concerns, we can guarantee vectorization to target-specific
> vectors.  Having a matrix-multiply intrinsic also allows using FMA
> regardless of the optimization level which is the usual sticking point with
> adopting FP-contraction.
>
> Adding a new dedicated first-class type has several advantages over
> mapping them directly to existing IR types like vectors in the front end.
> Matrices have the unique requirement that both rows and columns need to be
> accessed efficiently.  By retaining the original type, we can analyze
> entire chains of operations (after inlining) and determine the most
> efficient *intermediate layout* for the matrix values between ABI
> observable points (calls, memory operations).
>
> The resulting intermediate layout could be something like a single vector
> spanning the entire matrix or a set of vectors and scalars representing
> individual rows/columns.  This is beneficial for example because
> rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3
> matrix).
>
> The layout could also be made up of tiles/submatrices of the matrix.  This
> is an important case for us to fight register pressure.  Rather than
> loading entire matrices into registers it lets us load only parts of the
> input matrix at a time in order to compute some part of the output matrix.
> Since this transformation reorders memory operations, we may also need to
> emit run-time alias checks.
>
> Having a dedicated first-class type also allows for dedicated
> target-specific *ABIs* for matrixes.  This is a pretty rich area for
> matrices.  It includes whether the matrix is stored row-major or
> column-major order.  Whether there is padding between rows/columns.  When
> and how matrices are passed in argument registers.  Providing flexibility
> on the ABI side was critical for the adoption of the new type at Apple.
>
> Having all this knowledge at the IR level means that *front-ends* are
> able to share the complexities of the implementation.  They just map their
> matrix type to the IR type and the builtins to intrinsics.
>
> At Apple, we also need to support *interoperability* between row-major
> and column-major layout.  Since conversion between the two layouts is
> costly, they should be separate types requiring explicit instructions to
> convert between them.  Extending the new type to include the order makes
> tracking the format easy and allows finding optimal conversion points.
>
> ** ABI **
>
> We currently default to column-major order with no padding between the
> columns in memory.  We have plans to also support row-major order and we
> would probably have to support padding at some point for targets where
> unaligned accesses are slow.  In order to make the IR self-contained I am
> planning to make the defaults explicit in the DataLayout string.
>
> For function parameters and return values, matrices are currently placed
> in memory.  Moving forward, we should pass small matrices in vector
> registers.  Treating matrices as structures of vectors seems a natural
> default.   This works well for AArch64, since Homogenous Short-Vector
> Aggregates (HVA) can use all 8 SIMD argument registers.  Thus we could pass
> for example two 4 x 4 x float matrices in registers.  However on X86, we
> can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float
> matrices.
>
> Alternatively, we could treat a matrix as if its rows/columns were passed
> as separate vector arguments.  This would allow using all 8 vector argument
> registers on X86 too.
>
> Alignment of the matrix type is the same as the alignment of its first
> row/column vector.
>
> ** Flow **
>
> Clang at this point mostly just forwards everything to LLVM.  Then in
> LLVM, we have an IR function pass that lowers the matrices to
> target-supported vectors.  As with vectors, matrices can be of any static
> size with any of the primitive types as the element type.
>
> After the lowering pass, we only have matrix function arguments and
> instructions building up and splitting matrix values from and to vectors.
> CodeGen then lowers the arguments and forwards the vector values.  CodeGen
> is already capable of further lowering vectors of any size to scalars if
> the target does not support vectors.
>
> The lowering pass is also run at -O0 rather than legitimizing the matrix
> type during CodeGen like it’s done for structure values or invalid
> vectors.  I don’t really see a big value of duplicating this logic across
> the IR and CodeGen.  We just need a lighter mode in the pass at -O0.
>
> ** Roll-out and Maintenance **
>
> Since this will be experimental for some time, I am planning to put this
> behind a flag: -fenable-experimental-matrix-type.  ABI and intrinsic
> compatibility won’t be guaranteed initially until we lift the experimental
> status.
>
> We are obviously interested in maintaining and improving this code in the
> future.
>
> Looking forward to comments and suggestions.
>
> Thanks,
> Adam
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20181011/235a72bc/attachment.html>