[llvm-dev] [cfe-dev] RFC: Matrix math support

Tue Nov 19 13:24:00 PST 2019

On Mon, 28 Oct 2019 at 09:08, Florian Hahn via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> Hello,This is a follow-up to Adam’s original RFC (RFC: First-class Matrix
> type, [1]
> <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html>)
> <https://confluence.sd.apple.com/pages/createpage.action?spaceKey=DPT&title=1&linkCreation=true&fromPageId=1360610956> and
> his follow-up ([RFC] Matrix support (take 2, [2
> <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html>]).
> The main sticking points of the last thread were centred around
> representing padding, propagating shape information (dimensions) and the
> number of proposed intrinsics [8]
> <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>.We've
> incorporated the feedback into our design and would like to share our
> updated proposal. We stripped it down to the minimum required to realize
> the major benefits, like elimination of loads/stores to temporaries and
> generation of tuned loops for matrix multiplies. We discuss further
> extensions and points that came up during the earlier discussions after the
> new proposal.
>
> TLDR: Our new proposal boils down to adding 4 matrix related intrinsics
> (transpose, multiply, and strided load & store) with shape information and
> an IR pass, that lowers the intrinsics to regular vector operations. The IR
> pass can also be extended to break down operations into chunks suitable for
> specialised matrix ISAs (like the matrix instructions included in ARM v8.6)
> and to reducing spilling. We've designed the approach so it's flexible and
> easy to extend.
> ProposalAfter the initial feedback, we evaluated using ‘flattened’
> vectors to hold matrix values, instead of adding a new matrix type, as
> suggested originally. This was option #1 suggested in [3]
> <http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html>.
>
> With that in mind, we propose to add a set of experimental intrinsics for
> matrix operations that require information about the shape/layout of the
> underlying matrix. The suggested intrinsics take the shape information as
> additional (constant) integer arguments and column major layout is assumed
> initially. With the new proposal, there are very few intrinsics that
> actually care about the memory layout and they can be easily extended to
> also support row-major layouts.
>
> Initially, we would like to propose the following intrinsics:
>
>    - <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32 <N>)
>    Treat %in as containing a matrix with M rows and N columns and
>    transpose it.
>
>    - <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32 <M>, i32 <N>,
>    i32<K>)
>    Treat %X as matrix with M rows and K columns, %Y as matrix with K rows
>    and N columns and multiply them.
>
>    - <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32
>    <N>)
>    Load a matrix with M rows and N columns, using a stride of %Stride
>    between columns. This allows for convenient loading of sub matrixes.
>
>    - void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64
>    %Stride, i32 <M>, i32 <N>)
>    Store a matrix with M rows and N columns, using a stride of %Stride
>    between columns. This allows for convenient storing of sub matrixes.
>
>
> The floating point versions of the intrinsics also take fast-math flags,
> which can be used to opt-in to FMA generation and/or constant folding
> opportunities via NoInfs and NoNaNs. We plan to add them to the lowered
> instructions and rely on InstCombine & Co for related optimisations.
>
> The intrinsics will be lowered to regular LLVM vector operations in a IR
> lowering pass. This means per default, we can lower the builtins on all
> targets. Additionally, targets can implement their own lowering to a
> specialized ISA extension. We implemented such a lowering for an internal
> matrix accelerator.
>
> The example below shows how the matrix_transpose builtin will be lowered
> in the IR lowering pass.
>
> define void @transpose(<8 x double> * %A, <8 x double>* %B) {
> entry:
> %a = load <8 x double>, <8 x double>* %A, align 16
> %b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32
> 4)
> store <8 x double> %c, <8 x double>* %B, align 16
> ret void
> }
> declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)
>
> Will get lowered to something like
>
> define void @transpose(<8 x double> * %A, <8 x double>* %B) {
> entry:
> %0 = bitcast <8 x double>* %A to <2 x double>*
> %1 = load <2 x double>, <2 x double>* %0, align 8
> %2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
> %3 = bitcast double* %2 to <2 x double>*
> %4 = load <2 x double>, <2 x double>* %3, align 8
> %5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
> %6 = bitcast double* %5 to <2 x double>*
> %7 = load <2 x double>, <2 x double>* %6, align 8
> %8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x i32> <i32 0,
> i32 1, i32 undef, i32 undef>
> %9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
> %10 = bitcast double* %9 to <2 x double>*
> %11 = load <2 x double>, <2 x double>* %10, align 8
> %12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x i32> <i32
> 0, i32 1, i32 undef, i32 undef>
> %13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 0,
> i32 2, i32 undef, i32 undef>
> %14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x i32> <i32 0,
> i32 1, i32 4, i32 undef>
> %15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x i32> <i32 0,
> i32 1, i32 2, i32 4>
> %16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 1,
> i32 3, i32 undef, i32 undef>
> %17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x i32> <i32 0,
> i32 1, i32 5, i32 undef>
> %18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x i32> <i32 0,
> i32 1, i32 2, i32 5>
> %19 = bitcast <8 x double>* %C to <4 x double>*
> store <4 x double> %15, <4 x double>* %19, align 8
> %20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4
> %21 = bitcast double* %20 to <4 x double>*
> store <4 x double> %18, <4 x double>* %21, align 8
> ret void
> }
>
> Before we do the actual lowering, we propagate the shape information from
> intrinsics to connected instructions. This allows us to improve the code we
> generate for regular IR operations on matrixes embedded in a flattened
> vector. In the example above, we propagate the information that we are
> loading a matrix with 2 rows and 4 columns to `%a = load <8 x double>, <8 x
> double>* %A, align 16` and lower it to a series of `load <2 x double>, <2 x
> double>*`, which helps with avoiding a large number of shufflevector
> instructions to get column vectors. Please note that propagating the shape
> information allows us to improve the code we generate during lowering, but
> is not required for correctness. Without propagating shape information, we
> would just need additional shuffles to extract the rows/columns at the
> point where we lower a matrix multiply for example.
>
> This infrastructure can also be used to improve codegen for large vector
> operations in general. For example, consider loading 2 <100 x double>
> vectors, adding them and storing the result. Currently this will be lowered
> naively doing most of the loads first, then most of the adds and then most
> of the stores, causing plenty of spilling on AArch64. We can significantly
> improve the generated code by breaking it into load/add/store blocks that
> fit into the target registers.
> We also plan to implement some additional optimisations as part of the
> lowering, like generating fused/tiled loops for matrix multiplications and
> direct FMA generation.
>
> As future work, we are also evaluating adding additional operations
> including clamping a vector or matrix, min/max of matrixes, inverting a
> matrix and computing matrix determinates. Some of those might require
> additional intrinsics, which we can discuss on a case by case basis.
>
> In terms of integration, we are planning on adding a separate
> MatrixIRBuilder utility (as suggested in [4]
> <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html>),
> which can be used by frontends to create various matrix operations. For
> point wise operations, they will lower to vector instructions directly.
> Potential Future ExtensionsDuring the previous discussions, a few
> extensions were discussed, but we would like exclude them from the initial
> implementation. Those are
>
>    - N-Dimension
>    Given that the dimensions are represented as arguments to intrinsics,
>    we can go from supporting 2 dimensions to N dimensions by using variadic
>    arguments for shape information and handle it accordingly while lowering.
>    - Padding
>    For small matrixes, it might be desirable to automatically add
>    additional padding to columns/rows, e.g. add 1 element padding to each
>    column in a 3 x 3 matrix, to allow for using vector instructions operating
>    on power-of-2 number of elements or satisfy an alignment requirement by a
>    target. This allows for additional optimizations, but is not required for
>    lowering the intrinsics. We also haven't seen this being an issue so far.
>    We should be able to iterate on that, once it becomes an issue. (Earlier
>    discussion [5]
>    <http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html> )
>
> Support in ClangWe are also planning to expose the matrix support in
> Clang via a matrix type on the C/C++ level (similar to the
> ext_vector_type attribute) together with a set of matrix builtins.
> Similar to the LLVM part, we would like to propose a pragmatic solution for
> common matrix operations, while being flexible enough to allow iterative
> improvements to accommodate additional requirements. The main motivation
> for the matrix support on the Clang side is to give users a way to
>
>    - Guarantee generating high-quality code for matrix operations and
>    trees of matrix operations. For isolated operations, we can guarantee
>    vector code generation suitable for the target. For trees of operations,
>    the proposed value type helps with eliminating temporary loads & stores.
>    - Make use of specialized matrix ISA extensions, like the new matrix
>    instructions in ARM v8.6 or various proprietary matrix accelerators, in
>    their C/C++ code.
>    - Move optimisations from matrix wrapper libraries into the compiler.
>    We use it internally to simplify an Eigen-style matrix library, by relying
>    on LLVM for generating tiled & fused loops for matrix operations.
>
>
> We propose adding a new matrix value type, that can be declared via a
> `matrix_type` attribute. Alternatively we could also generalise the
> existing ext_vector_type attribute ([6]
> <https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors>),
> if that is preferred. Initially, the matrix type supports 2 dimensions
> (rows & columns). For example, a 4x4 float matrix would be declared as
> `typedef float m4x4_t __attribute__((matrix_type(4, 4)));`.
>

Hi Florian,

You can find Clang's policy for accepting language extensions described at
http://clang.llvm.org/get_involved.html

This extension clearly satisfies point 2, and I'm happy to trust that we
can iterate to satisfying points 6 and 7. I would like to hear your
thoughts on the other criteria.

Initially we want to focus on 2D matrixes without padding in column-major
> layout as a concrete use case. But our proposed type can be extended
> naturally to
>
>    - Support N (known constant) dimensions by turning matrix_type
>    attribute into a variadic attribute.
>    - Support column/row-wise padding, by adding a column_padding clause
>    to the attribute.
>    Dealing with the padding could be exclusively handled on the frontend
>    side, by emitting additional shufflevector instructions to extract the
>    data. If there is a desire to exploit the padding more on the LLVM side, we
>    can add a set of intrinsics for that.
>    - Support row & column major layouts, by adding a layout clause to the
>    attribute.
>    Again, this naively could be handled while lowering to LLVM IR in
>    Clang using shufflevector to produce flattened vectors with the required
>    layout. For better optimisations, the LLVM intrinsics relying on
>    shape/layout information can be extended to take the layout as additional
>    argument. Through propagating the layout information similar to the
>    dimensions, we should be able to optimise the points where we need to
>    transform the layout of the underlying matrixes.
>
> In all cases, we require known constants as dimensions and we do not plan
> to support dynamic dimensions for now.
>
> In addition to the type, we propose adding a set of overloaded builtins to
> perform operations on matrix values. To start with, we would like to add
> the following builtins: __builtin_matrix_add, __builtin_matrix_subtract,
> __builtin_matrix_multiply, __builtin_matrix_scalar_multiply,
> __builtin_matrix_transpose, __builtin_matrix_insert,
> __builtin_matrix_extract, __builtin_matrix_column_load,
> __builtin_matrix_column_store. Those builtins allowed us to cover a wide
> range of uses cases internally, but as mentioned on the LLVM side already,
> we are also evaluating adding additional operations, like clamping matrixes
> or getting the minimum/maximum of a matrix. We should be able to iterate on
> the set of operations, as required by actual users.
>
> In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see
> LLVM part), that can lower the various builtins to LLVM IR (in terms of
> matrix builtins or vector operations, depending on the builtin).
>
>
>
> We currently have an internal implementation based on Adam’s original
> proposal ('first class matrix type') and are using that for a large
> internal project in production. We also prototyped the approach outlined in
> this proposal ('flattened matrixes') to make sure we can produce equivalent
> code (and thus performance) with both approaches on a large codebase.
>
> We think predication is out of scope for the initial proposal and the
> right forum to discuss predication support in general are the related
> discussions on the list, the LLVM-VP proposal, [7]
> <https://reviews.llvm.org/D57504> and the round table at the Dev Meeting.
>
> We think our current proposal addresses the concerns raised previously,
> especially the concerns around the high cost of adding a new IR type,
> adding too many new intrinsics and generalising the approach to N
> dimensions. Unless there are any additional major concerns, we should be
> able to share patches for review soon.
>
> Cheers,
> Florian
>
>
> [1] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html
> [2] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128322.html
> [3] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126982.html
> [4] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
> [5] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128331.html
> [6]
> https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
> [7] https://reviews.llvm.org/D57504
> [8] http://lists.llvm.org/pipermail/llvm-dev/2018-December/128636.html
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191119/ef188824/attachment.html>