[cfe-dev] RFC: First-class Matrix type

Tue Oct 16 11:12:17 PDT 2018

On Oct 10, 2018, at 11:09 PM, Adam Nemet via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> Hi,
> 
> We are proposing first-class type support for a new matrix type.  

Interesting!  Here are some thoughts, I’m sorry but I haven’t read the responses downthread.

> This is a natural extension of the current vector type with an extra dimension.
> For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:
> 
> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
> %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

LLVM already has a pretty general vector type (arbitrary number of elements).  I’m aware of hardware that has rectangular vectors, e.g. nvidia tensor cores, Google has a proprietary in-house design with non-square vector registers, etc.

> Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element.  Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory.  We are also planning to implement vector-extract/insert and matrix-vector multiply.
> 
> All of these are currently implemented as intrinsics.  Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

Ok.  Makes sense, I agree that supporting the existing pointwise vector operations makes sense.

> These are exposed in clang via builtins.  E.g. the above operations looks like this in C/C++:
> 
> typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
> 
> mf4x4_t add(mf4x4_t a, mf4x4_t b) {
>   return __builtin_matrix_multiply(a, b);
> }

I’d recommend splitting the clang discussion from the LLVM discussion, they are completely different tradeoffs involved.  I’ll focus on the LLVM IR side of things.

> ** Benefits **
> 
> Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations.  But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors.  

Right, it is basically the same benefit as having a vector type.  You also get the ability to have specific alignments etc.

I think there are several options in the design space here:

1. Do nothing to the type system, but just use the existing vector types (<16 x float> in this case) with a new set of operations.
2. Introduce a “matrix” concept and associated operations.
3. Introduce N-dimensional vector registers and associated operations.

Personally, I’d be in favor of going with #1 followed by #3 followed distantly by #2.

The reason I’m opposed to a matrix *type* is that this is far too specific of a concept to put into LLVM.  We don’t even have signedness of integers in the type system: the instruction set is the major load bearing part of the IR design, and the instruction set is extensible through intrinsics.

Arguing in favor of #1: AFAICT, you only need to add the new intrinsics to do matmul etc.  You could just define them to take 1D vectors but apply math to them that interprets them as a 2D space.  This is completely an IR level modeling issue, and would be a very non-invasive patch.  You’re literally just adding a few intrinsics.  All the pointwise operations and insert/extract/shuffles will “just work”.  The frontend handles mapping 2d indices to 1D indices.

f you are interested in expanding the type system, I think that #3 is the way to go: extend VectorType to support multiple constant dimensions, like <4 x 32 x float>.  I think that all of the existing vector intrinsics and instructions are still defined (pointwise).  You’d then add matrix specific intrinsics like your matmul above.  If you are going to extend the type system, here are things to think about:

1) You will need to introduce legalization infra to lower all valid (as in, passing the verifier) cases to arbitrary targets, e.g. turning Nd elementwise add operations into <4 x float> adds or scalar operations.  Legalization has a lot of the infra to do this, but it would be a big project.

2) Extending the type system has non-local effects because you have to audit (e.g.) the mid level optimizer to make sure that everything touching vector type, or “add” is doing the right thing for the new case.

3) Extending the IR like this is plausible, but would need to be carefully considered by the community.

> Having a matrix-multiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FP-contraction.

Just to nit-pick here, but I’d suggest following the existing precedent of the IR in terms of contraction etc.  You shouldn’t assume that all matmul-adds are contractable just because your current client want that.  Consistency with the rest of the IR is far more important and we would WANT other clients to use this someday.

> Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end.  Matrices have the unique requirement that both rows and columns need to be accessed efficiently.  By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).

I don’t understand this point at all.

> The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns.  This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).

LLVM already has alignment support for memory objects, and the frontend can handle complex lowering like turning a 3x3 matrix into an LLVM IR <16xfloat> (i.e. a 4x4 matrix) if that is more efficient for some reason.  If there is a design to put this logic into the code generator and make it frontend independent etc, then putting it in the LLVM IR type system is reasonable, so long as you “do it right” and implement full support for this (including legalization for all targets etc).

> The layout could also be made up of tiles/submatrices of the matrix.  This is an important case for us to fight register pressure.  Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix.  Since this transformation reorders memory operations, we may also need to emit run-time alias checks.

LLVM isn’t really set up to do these sorts of optimizations on the IR: it treats SSA values as units, and tiling within an ND vector will be challenging.  if you're interested in tackling this, it would be great to start doing it for 1D vectors, which have exactly the same issue: we generate atrocious code for chains of <256 x float> operations on machines that support <4 x float> because of register pressure etc.

> Having a dedicated first-class type also allows for dedicated target-specific ABIs for matrixes.  This is a pretty rich area for matrices.  It includes whether the matrix is stored row-major or column-major order.  Whether there is padding between rows/columns.  When and how matrices are passed in argument registers.  Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.

There are other ways to model this.  You can just add a new llvm parameter attribute if necessary.

> Having all this knowledge at the IR level means that front-ends are able to share the complexities of the implementation.  They just map their matrix type to the IR type and the builtins to intrinsics.

Agreed, but this only happens through a significant investment in making the mid-level IR general.  I’m not opposed to this, but this is a significant engineering task.

> At Apple, we also need to support interoperability between row-major and column-major layout.  Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them.  Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.

I don’t see how your proposal helps with this.  How do you represent layout in the IR?

> ** Roll-out and Maintenance **
> 
> Since this will be experimental for some time, I am planning to put this behind a flag: -fenable-experimental-matrix-type.  ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.

I think it is really important to understand the design and get buy-in from many folks before starting to land patches.  I’d also love to chat about this at the devmtg if you’re available.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20181016/68e5a9c1/attachment.html>