[cfe-dev] [llvm-dev] RFC: First-class Matrix type
Jacob Lifshay via cfe-dev
cfe-dev at lists.llvm.org
Thu Oct 11 00:58:29 PDT 2018
This sounds like it would be really useful for 3D Graphics APIs as SPIR-V
(the Vulkan/OpenCL2.1 intermediate representation) has matrices as
first-class types. This will promote round-trip-ability. I have also heard
that RISC-V's V extension wants to support matrices (not sure if it's as an
additional extension on top of V or as part of V proper).
For the ABI, I recommend considering compatibility with the layouts
required for 3D Graphics APIs. For Vulkan, I think the relevant section in
the spec is 14.5.4.
Jacob Lifshay
On Wed, Oct 10, 2018 at 11:10 PM Adam Nemet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> We are proposing first-class type support for a new matrix type. This is
> a natural extension of the current vector type with an extra dimension.
> For example, this is what the IR for a matrix multiply would look like for
> a 4x4 matrix with element type float:
>
> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
> %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4
> x 4 x float> %0, <4 x 4 x float> %1)
> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
>
>
> Currently we support element-wise binary operations, matrix multiply,
> matrix-scalar multiply, matrix transpose, extract/insert of an element.
> Besides the regular full-matrix load and store, we also support loading and
> storing a matrix as a submatrix of a larger matrix in memory. We are also
> planning to implement vector-extract/insert and matrix-vector multiply.
>
> All of these are currently implemented as intrinsics. Where applicable we
> also plan to support these operations with native IR instructions (e.g.
> add/fadd).
>
> These are exposed in clang via builtins. E.g. the above operations looks
> like this in C/C++:
>
> typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
>
> mf4x4_t add(mf4x4_t a, mf4x4_t b) {
> return __builtin_matrix_multiply(a, b);
> }
>
>
> ** Benefits **
>
> Having matrices represented as IR values allows for the usual algebraic
> and redundancy optimizations. But most importantly, by lifting memory
> aliasing concerns, we can guarantee vectorization to target-specific
> vectors. Having a matrix-multiply intrinsic also allows using FMA
> regardless of the optimization level which is the usual sticking point with
> adopting FP-contraction.
>
> Adding a new dedicated first-class type has several advantages over
> mapping them directly to existing IR types like vectors in the front end.
> Matrices have the unique requirement that both rows and columns need to be
> accessed efficiently. By retaining the original type, we can analyze
> entire chains of operations (after inlining) and determine the most
> efficient *intermediate layout* for the matrix values between ABI
> observable points (calls, memory operations).
>
> The resulting intermediate layout could be something like a single vector
> spanning the entire matrix or a set of vectors and scalars representing
> individual rows/columns. This is beneficial for example because
> rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3
> matrix).
>
> The layout could also be made up of tiles/submatrices of the matrix. This
> is an important case for us to fight register pressure. Rather than
> loading entire matrices into registers it lets us load only parts of the
> input matrix at a time in order to compute some part of the output matrix.
> Since this transformation reorders memory operations, we may also need to
> emit run-time alias checks.
>
> Having a dedicated first-class type also allows for dedicated
> target-specific *ABIs* for matrixes. This is a pretty rich area for
> matrices. It includes whether the matrix is stored row-major or
> column-major order. Whether there is padding between rows/columns. When
> and how matrices are passed in argument registers. Providing flexibility
> on the ABI side was critical for the adoption of the new type at Apple.
>
> Having all this knowledge at the IR level means that *front-ends* are
> able to share the complexities of the implementation. They just map their
> matrix type to the IR type and the builtins to intrinsics.
>
> At Apple, we also need to support *interoperability* between row-major
> and column-major layout. Since conversion between the two layouts is
> costly, they should be separate types requiring explicit instructions to
> convert between them. Extending the new type to include the order makes
> tracking the format easy and allows finding optimal conversion points.
>
> ** ABI **
>
> We currently default to column-major order with no padding between the
> columns in memory. We have plans to also support row-major order and we
> would probably have to support padding at some point for targets where
> unaligned accesses are slow. In order to make the IR self-contained I am
> planning to make the defaults explicit in the DataLayout string.
>
> For function parameters and return values, matrices are currently placed
> in memory. Moving forward, we should pass small matrices in vector
> registers. Treating matrices as structures of vectors seems a natural
> default. This works well for AArch64, since Homogenous Short-Vector
> Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass
> for example two 4 x 4 x float matrices in registers. However on X86, we
> can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float
> matrices.
>
> Alternatively, we could treat a matrix as if its rows/columns were passed
> as separate vector arguments. This would allow using all 8 vector argument
> registers on X86 too.
>
> Alignment of the matrix type is the same as the alignment of its first
> row/column vector.
>
> ** Flow **
>
> Clang at this point mostly just forwards everything to LLVM. Then in
> LLVM, we have an IR function pass that lowers the matrices to
> target-supported vectors. As with vectors, matrices can be of any static
> size with any of the primitive types as the element type.
>
> After the lowering pass, we only have matrix function arguments and
> instructions building up and splitting matrix values from and to vectors.
> CodeGen then lowers the arguments and forwards the vector values. CodeGen
> is already capable of further lowering vectors of any size to scalars if
> the target does not support vectors.
>
> The lowering pass is also run at -O0 rather than legitimizing the matrix
> type during CodeGen like it’s done for structure values or invalid
> vectors. I don’t really see a big value of duplicating this logic across
> the IR and CodeGen. We just need a lighter mode in the pass at -O0.
>
> ** Roll-out and Maintenance **
>
> Since this will be experimental for some time, I am planning to put this
> behind a flag: -fenable-experimental-matrix-type. ABI and intrinsic
> compatibility won’t be guaranteed initially until we lift the experimental
> status.
>
> We are obviously interested in maintaining and improving this code in the
> future.
>
> Looking forward to comments and suggestions.
>
> Thanks,
> Adam
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20181011/235a72bc/attachment.html>
More information about the cfe-dev
mailing list