[llvm-dev] RFC: Matrix math support

Mon Oct 28 09:07:55 PDT 2019

Hello,

This is a follow-up to Adam’s original RFC (RFC: First-class Matrix type, [1] <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html>) <https://confluence.sd.apple.com/pages/createpage.action?spaceKey=DPT&title=1&linkCreation=true&fromPageId=1360610956> and his follow-up ([RFC] Matrix support (take 2, [2 <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html>]). The main sticking points of the last thread were centred around representing padding, propagating shape information (dimensions) and the number of proposed intrinsics [8] <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>.

We've incorporated the feedback into our design and would like to share our updated proposal. We stripped it down to the minimum required to realize the major benefits, like elimination of loads/stores to temporaries and generation of tuned loops for matrix multiplies. We discuss further extensions and points that came up during the earlier discussions after the new proposal.

TLDR: Our new proposal boils down to adding 4 matrix related intrinsics (transpose, multiply, and strided load & store) with shape information and an IR pass, that lowers the intrinsics to regular vector operations. The IR pass can also be extended to break down operations into chunks suitable for specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and to reducing spilling. We've designed the approach so it's flexible and easy to extend.
Proposal

After the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3] <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>. 

With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support row-major layouts.

Initially, we would like to propose the following intrinsics:
<C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32 <N>)
Treat %in as containing a matrix with M rows and N columns and transpose it. 

<C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32 <M>, i32 <N>, i32<K>)
Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and N columns and multiply them. 

<C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Load a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient loading of sub matrixes. 

void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Store a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient storing of sub matrixes.

The floating point versions of the intrinsics also take fast-math flags, which can be used to opt-in to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.

The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets. Additionally, targets can implement their own lowering to a specialized ISA extension. We implemented such a lowering for an internal matrix accelerator. 

The example below shows how the matrix_transpose builtin will be lowered in the IR lowering pass.

define void @transpose(<8 x double> * %A, <8 x double>* %B) { 
entry:
%a = load <8 x double>, <8 x double>* %A, align 16
%b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
store <8 x double> %c, <8 x double>* %B, align 16
ret void
}
declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

Will get lowered to something like 

define void @transpose(<8 x double> * %A, <8 x double>* %B) {
entry:
%0 = bitcast <8 x double>* %A to <2 x double>*
%1 = load <2 x double>, <2 x double>* %0, align 8
%2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
%3 = bitcast double* %2 to <2 x double>*
%4 = load <2 x double>, <2 x double>* %3, align 8
%5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
%6 = bitcast double* %5 to <2 x double>*
%7 = load <2 x double>, <2 x double>* %6, align 8
%8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
%10 = bitcast double* %9 to <2 x double>*
%11 = load <2 x double>, <2 x double>* %10, align 8
%12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 0, i32 2, i32 undef, i32 undef>
%14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>
%15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
%16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
%17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>
%18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 5>
%19 = bitcast <8 x double>* %C to <4 x double>*
store <4 x double> %15, <4 x double>* %19, align 8
%20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4
%21 = bitcast double* %20 to <4 x double>*
store <4 x double> %18, <4 x double>* %21, align 8
ret void
}

Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>* %A, align 16` and lower it to a series of `load <2 x double>, <2 x double>*`, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.

This infrastructure can also be used to improve codegen for large vector operations in general. For example, consider loading 2 <100 x double> vectors, adding them and storing the result. Currently this will be lowered naively doing most of the loads first, then most of the adds and then most of the stores, causing plenty of spilling on AArch64. We can significantly improve the generated code by breaking it into load/add/store blocks that fit into the target registers.
We also plan to implement some additional optimisations as part of the lowering, like generating fused/tiled loops for matrix multiplications and direct FMA generation.

As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.

In terms of integration, we are planning on adding a separate MatrixIRBuilder utility (as suggested in [4] <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>), which can be used by frontends to create various matrix operations. For point wise operations, they will lower to vector instructions directly. 
Potential Future Extensions

During the previous discussions, a few extensions were discussed, but we would like exclude them from the initial implementation. Those are
N-Dimension
Given that the dimensions are represented as arguments to intrinsics, we can go from supporting 2 dimensions to N dimensions by using variadic arguments for shape information and handle it accordingly while lowering.
Padding
For small matrixes, it might be desirable to automatically add additional padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3 matrix, to allow for using vector instructions operating on power-of-2 number of elements or satisfy an alignment requirement by a target. This allows for additional optimizations, but is not required for lowering the intrinsics. We also haven't seen this being an issue so far. We should be able to iterate on that, once it becomes an issue. (Earlier discussion [5] <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html> )
Support in Clang

We are also planning to expose the matrix support in Clang via a matrix type on the C/C++ level (similar to the ext_vector_type attribute) together with a set of matrix builtins. Similar to the LLVM part, we would like to propose a pragmatic solution for common matrix operations, while being flexible enough to allow iterative improvements to accommodate additional requirements. The main motivation for the matrix support on the Clang side is to give users a way to
Guarantee generating high-quality code for matrix operations and trees of matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
Move optimisations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigen-style matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.

We propose adding a new matrix value type, that can be declared via a `matrix_type` attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6] <https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>), if that is preferred. Initially, the matrix type supports 2 dimensions (rows & columns). For example, a 4x4 float matrix would be declared as `typedef float m4x4_t __attribute__((matrix_type(4, 4)));`. 

Initially we want to focus on 2D matrixes without padding in column-major layout as a concrete use case. But our proposed type can be extended naturally to
Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
Support column/row-wise padding, by adding a column_padding clause to the attribute.
Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.
Support row & column major layouts, by adding a layout clause to the attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes.
In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.

In addition to the type, we propose adding a set of overloaded builtins to perform operations on matrix values. To start with, we would like to add the following builtins: __builtin_matrix_add, __builtin_matrix_subtract, __builtin_matrix_multiply, __builtin_matrix_scalar_multiply, __builtin_matrix_transpose, __builtin_matrix_insert, __builtin_matrix_extract, __builtin_matrix_column_load, __builtin_matrix_column_store. Those builtins allowed us to cover a wide range of uses cases internally, but as mentioned on the LLVM side already, we are also evaluating adding additional operations, like clamping matrixes or getting the minimum/maximum of a matrix. We should be able to iterate on the set of operations, as required by actual users.

In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see LLVM part), that can lower the various builtins to LLVM IR (in terms of matrix builtins or vector operations, depending on the builtin).

We currently have an internal implementation based on Adam’s original proposal ('first class matrix type') and are using that for a large internal project in production. We also prototyped the approach outlined in this proposal ('flattened matrixes') to make sure we can produce equivalent code (and thus performance) with both approaches on a large codebase.

We think predication is out of scope for the initial proposal and the right forum to discuss predication support in general are the related discussions on the list, the LLVM-VP proposal, [7] <https://reviews.llvm.org/D57504> and the round table at the Dev Meeting.

We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.

Cheers,
Florian

[1] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html>
[2] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html>
[3] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>
[4] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>
[5] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html>
[6] https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors <https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>
[7] https://reviews.llvm.org/D57504 <https://reviews.llvm.org/D57504>
[8] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191028/9c39b45b/attachment-0001.html>