[cfe-dev] RFC: First-class Matrix type

Thu Oct 11 19:24:26 PDT 2018

On Thu, 11 Oct 2018 at 18:29, Adam Nemet via cfe-dev <cfe-dev at lists.llvm.org>
wrote:

> On Oct 11, 2018, at 3:42 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>
> On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> We are proposing first-class type support for a new matrix type.  This is
>> a natural extension of the current vector type with an extra dimension.
>> For example, this is what the IR for a matrix multiply would look like
>> for a 4x4 matrix with element type float:
>>
>> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
>> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
>> %2 = call <4 x 4 x float>
>> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x
>> float> %1)
>> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
>>
>>
>> Currently we support element-wise binary operations, matrix multiply,
>> matrix-scalar multiply, matrix transpose, extract/insert of an element.
>> Besides the regular full-matrix load and store, we also support loading and
>> storing a matrix as a submatrix of a larger matrix in memory.  We are also
>> planning to implement vector-extract/insert and matrix-vector multiply.
>>
>> All of these are currently implemented as intrinsics.  Where applicable
>> we also plan to support these operations with native IR instructions (e.g.
>> add/fadd).
>>
>> These are exposed in clang via builtins.  E.g. the above operations looks
>> like this in C/C++:
>>
>> typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
>>
>> mf4x4_t add(mf4x4_t a, mf4x4_t b) {
>>   return __builtin_matrix_multiply(a, b);
>> }
>>
>>
> This seems to me to be proposing two distinct things -- built-in matrix
> support in the frontend as a language extension, and support for matrices
> in LLVM IR -- and I think it makes sense to discuss them at least somewhat
> separately.
>
>
> On the Clang side: our general policy for frontend language extensions is
> described here: http://clang.llvm.org/get_involved.html
>
> I'm totally happy to assume that you can cover most of those points, but
> point 4 seems likely to be a potential sticking point. Have you talked to
> WG14 about adding a matrix extension to C? (I'd note that they don't even
> have a vector language extension yet, but as noted on that page, we should
> be driving the relevant standards, not diverging from them, so perhaps we
> should be pushing for that too). Have you talked to the people working on
> adding vector types to C++ about this (in particular, Matthias Kretz and
> Tim Shen; see
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for
> current / prior work in this area)?
>
>
> Yes, we’re certainly interested pushing this into the appropriate language
> standards.  As I was saying in the proposal, right now this is
> experimental.  We see great performance improvement with this but we’d like
> to get feedback and data from the wider community and also continue to
> improve the implementation.  Also the surface area for this extension is
> pretty minimal and piggybacks vector.  As you’ve seen in the thread there
> is a diverse set of people from GPU to CPU expressing interests and
> preferences.  My goal was to put this out there and have people collaborate
> and then propose what’s actually working.  Obviously ABI is a big
> consideration here and I don’t want to design in a vacuum.  Would this
> model work for you?
>

Yes, I think that's OK, but I also think you should start talking to the
relevant people from WG14 and WG21 early (ie, right now) so that they can
provide feedback to you and vice versa about the needs of the most likely
eventual language model and of the optimizer. We don't want to find in a
year or two that this somehow fundamentally fails to satisfy the frontend
language needs, or conversely that WG21 has standardized something that's
problematic to lower to our optimizable primitives.

If you've not done so already, it'd be a good idea to also talk to the
Eigen folks to make sure that what you're adding is something they could
use in the cases where it's applicable.

I also strongly believe that the right approach for good matrix performance
> is gradual lowering rather than lowering all the way to loops and then
> trying to recover the original intent of the program which is what the
> above approach takes.
>
> On the LLVM IR side, I'm personally unconvinced that we should model
> matrices in the IR directly as a new first-class type, unless there's some
> target out there that has matrix operations in hardware / matrix registers,
> but IR is not really my area of expertise so give that opinion as much or
> little weight as you see fit.
>
>
> Since people already spoke up about direct HW needs, you’re probably
> satisfied here
>

Yes, I am.

> but as I said there are other benefits.  For a chain of matrix operations
> it is very beneficial to fuse the operations and then have it operate on
> tiles of the matrices at a time.  You don’t need to go to very large
> matrices at all to start to hit this.
>

Well, modern C++ matrix libraries deal with similar issues using expression
templates, but such an approach is certainly not without drawbacks.

> However, I do wonder: how is this different from, say, complex numbers,
> for which we don't have a native IR representation? (Maybe the answer is
> that we should have native IR support for complex numbers too.)
>
>
> That’s hard for me to answer as I didn’t look at complex-number
> performance.
>
> How would you expect the frontend to lower (eg) matrix multiplication for
> matrices of complex numbers?
>
>
> As you say, complex numbers are not a first-class type so this is not
> supported just like it’s not for vectors, i.e. you’d have to use an array
> of complex numbers.  That said this may tip the scales to introduce complex
> numbers as well.
>

This seems worth thinking about early, even if you choose to defer
implementing anything in this space.

> Adam
>
>
> ** Benefits **
>>
>> Having matrices represented as IR values allows for the usual algebraic
>> and redundancy optimizations.  But most importantly, by lifting memory
>> aliasing concerns, we can guarantee vectorization to target-specific
>> vectors.  Having a matrix-multiply intrinsic also allows using FMA
>> regardless of the optimization level which is the usual sticking point with
>> adopting FP-contraction.
>>
>> Adding a new dedicated first-class type has several advantages over
>> mapping them directly to existing IR types like vectors in the front end.
>> Matrices have the unique requirement that both rows and columns need to be
>> accessed efficiently.  By retaining the original type, we can analyze
>> entire chains of operations (after inlining) and determine the most
>> efficient *intermediate layout* for the matrix values between ABI
>> observable points (calls, memory operations).
>>
>> The resulting intermediate layout could be something like a single vector
>> spanning the entire matrix or a set of vectors and scalars representing
>> individual rows/columns.  This is beneficial for example because
>> rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3
>> matrix).
>>
>> The layout could also be made up of tiles/submatrices of the matrix.
>> This is an important case for us to fight register pressure.  Rather than
>> loading entire matrices into registers it lets us load only parts of the
>> input matrix at a time in order to compute some part of the output matrix.
>> Since this transformation reorders memory operations, we may also need to
>> emit run-time alias checks.
>>
>> Having a dedicated first-class type also allows for dedicated
>> target-specific *ABIs* for matrixes.  This is a pretty rich area for
>> matrices.  It includes whether the matrix is stored row-major or
>> column-major order.  Whether there is padding between rows/columns.  When
>> and how matrices are passed in argument registers.  Providing flexibility
>> on the ABI side was critical for the adoption of the new type at Apple.
>>
>> Having all this knowledge at the IR level means that *front-ends* are
>> able to share the complexities of the implementation.  They just map their
>> matrix type to the IR type and the builtins to intrinsics.
>>
>> At Apple, we also need to support *interoperability* between row-major
>> and column-major layout.  Since conversion between the two layouts is
>> costly, they should be separate types requiring explicit instructions to
>> convert between them.  Extending the new type to include the order makes
>> tracking the format easy and allows finding optimal conversion points.
>>
>> ** ABI **
>>
>> We currently default to column-major order with no padding between the
>> columns in memory.  We have plans to also support row-major order and we
>> would probably have to support padding at some point for targets where
>> unaligned accesses are slow.  In order to make the IR self-contained I am
>> planning to make the defaults explicit in the DataLayout string.
>>
>> For function parameters and return values, matrices are currently placed
>> in memory.  Moving forward, we should pass small matrices in vector
>> registers.  Treating matrices as structures of vectors seems a natural
>> default.   This works well for AArch64, since Homogenous Short-Vector
>> Aggregates (HVA) can use all 8 SIMD argument registers.  Thus we could pass
>> for example two 4 x 4 x float matrices in registers.  However on X86, we
>> can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float
>> matrices.
>>
>> Alternatively, we could treat a matrix as if its rows/columns were passed
>> as separate vector arguments.  This would allow using all 8 vector argument
>> registers on X86 too.
>>
>> Alignment of the matrix type is the same as the alignment of its first
>> row/column vector.
>>
>> ** Flow **
>>
>> Clang at this point mostly just forwards everything to LLVM.  Then in
>> LLVM, we have an IR function pass that lowers the matrices to
>> target-supported vectors.  As with vectors, matrices can be of any static
>> size with any of the primitive types as the element type.
>>
>> After the lowering pass, we only have matrix function arguments and
>> instructions building up and splitting matrix values from and to vectors.
>> CodeGen then lowers the arguments and forwards the vector values.  CodeGen
>> is already capable of further lowering vectors of any size to scalars if
>> the target does not support vectors.
>>
>> The lowering pass is also run at -O0 rather than legitimizing the matrix
>> type during CodeGen like it’s done for structure values or invalid
>> vectors.  I don’t really see a big value of duplicating this logic across
>> the IR and CodeGen.  We just need a lighter mode in the pass at -O0.
>>
>> ** Roll-out and Maintenance **
>>
>> Since this will be experimental for some time, I am planning to put this
>> behind a flag: -fenable-experimental-matrix-type.  ABI and intrinsic
>> compatibility won’t be guaranteed initially until we lift the experimental
>> status.
>>
>> We are obviously interested in maintaining and improving this code in the
>> future.
>>
>> Looking forward to comments and suggestions.
>>
>> Thanks,
>> Adam
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20181011/1cd5cc92/attachment.html>