[llvm-dev] [cfe-dev] RFC: First-class Matrix type

Adam Nemet via llvm-dev llvm-dev at lists.llvm.org
Fri Oct 12 09:32:11 PDT 2018



> On Oct 11, 2018, at 7:22 PM, Finkel, Hal J. <hfinkel at anl.gov> wrote:
> 
> 
> On 10/11/2018 08:29 PM, Adam Nemet via cfe-dev wrote:
>> 
>> 
>>> On Oct 11, 2018, at 3:42 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>>> 
>>> On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>>> Hi,
>>> 
>>> We are proposing first-class type support for a new matrix type.  This is a natural extension of the current vector type with an extra dimension.
>>> For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:
>>> 
>>> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
>>> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
>>> %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
>>> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
>>> 
>>> Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element.  Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory.  We are also planning to implement vector-extract/insert and matrix-vector multiply.
>>> 
>>> All of these are currently implemented as intrinsics.  Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
>>> 
>>> These are exposed in clang via builtins.  E.g. the above operations looks like this in C/C++:
>>> 
>>> typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
>>> 
>>> mf4x4_t add(mf4x4_t a, mf4x4_t b) {
>>>   return __builtin_matrix_multiply(a, b);
>>> }
>>> 
>>> This seems to me to be proposing two distinct things -- built-in matrix support in the frontend as a language extension, and support for matrices in LLVM IR -- and I think it makes sense to discuss them at least somewhat separately.
>>> 
>>> 
>>> On the Clang side: our general policy for frontend language extensions is described here: http://clang.llvm.org/get_involved.html <http://clang.llvm.org/get_involved.html>
>>> 
>>> I'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf> and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf> for current / prior work in this area)?
>> 
>> Yes, we’re certainly interested pushing this into the appropriate language standards.  As I was saying in the proposal, right now this is experimental.  We see great performance improvement with this but we’d like to get feedback and data from the wider community and also continue to improve the implementation.  Also the surface area for this extension is pretty minimal and piggybacks vector.  As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences.  My goal was to put this out there and have people collaborate and then propose what’s actually working.  Obviously ABI is a big consideration here and I don’t want to design in a vacuum.  Would this model work for you?
>> 
>> I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes.
>> 
>>> On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.
>> 
>> Since people already spoke up about direct HW needs, you’re probably satisfied here but as I said there are other benefits.  For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices at a time.  You don’t need to go to very large matrices at all to start to hit this.
> 
> Can you please write more about the cases you're seeing where this provides benefits? What kinds of InstCombine(-like) rules have you implemented to go along with this?

Sure.  We have a case for example where the result of a matrix multiply is in-place added to a submatrix.  The intermediate matrix values can’t fit in the vector registers but even if they could using the minimal number of registers is very beneficial to provide more register renaming bandwidth to the OOO cores.  So we just load enough from the input matrices to compute a single native vector register-size chunk of the multiply, add the matching chunk from the submatrix and then store it back.  What the optimal input chunks are depends on whether the inputs are transposed or not which where the InstCombine-like rules come in.

This case takes advantage of another semantical rule with matrices that the new representation allows.  Because we know we are loading from the same submatrix that we store to we know that the different columns between the load and the store can’t alias which allows the loads and the stores to be reordered in order to enable the fusion above.

> Is this something that we could get if we had better loop transformations (i.e., loop fusion)?

As we both know, aliasing usually makes this pretty difficult.  For relatively small matrices run-time memchecks can be costly.  Also we need to be able to evaluate these fusion opportunities in the inliner so having first-class type representation makes this cheap.

Adam

> 
> Thanks again,
> Hal
> 
>> 
>>> However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)
>> 
>> That’s hard for me to answer as I didn’t look at complex-number performance.  
>> 
>>> How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
>> 
>> As you say, complex numbers are not a first-class type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers.  That said this may tip the scales to introduce complex numbers as well.
>> 
>> Adam
>> 
>>> 
>>> ** Benefits **
>>> 
>>> Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations.  But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors.  Having a matrix-multiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FP-contraction.
>>> 
>>> Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end.  Matrices have the unique requirement that both rows and columns need to be accessed efficiently.  By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
>>> 
>>> The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns.  This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
>>> 
>>> The layout could also be made up of tiles/submatrices of the matrix.  This is an important case for us to fight register pressure.  Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix.  Since this transformation reorders memory operations, we may also need to emit run-time alias checks.
>>> 
>>> Having a dedicated first-class type also allows for dedicated target-specific ABIs for matrixes.  This is a pretty rich area for matrices.  It includes whether the matrix is stored row-major or column-major order.  Whether there is padding between rows/columns.  When and how matrices are passed in argument registers.  Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
>>> 
>>> Having all this knowledge at the IR level means that front-ends are able to share the complexities of the implementation.  They just map their matrix type to the IR type and the builtins to intrinsics.
>>> 
>>> At Apple, we also need to support interoperability between row-major and column-major layout.  Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them.  Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
>>> 
>>> ** ABI **
>>> 
>>> We currently default to column-major order with no padding between the columns in memory.  We have plans to also support row-major order and we would probably have to support padding at some point for targets where unaligned accesses are slow.  In order to make the IR self-contained I am planning to make the defaults explicit in the DataLayout string.
>>> 
>>> For function parameters and return values, matrices are currently placed in memory.  Moving forward, we should pass small matrices in vector registers.  Treating matrices as structures of vectors seems a natural default.   This works well for AArch64, since Homogenous Short-Vector Aggregates (HVA) can use all 8 SIMD argument registers.  Thus we could pass for example two 4 x 4 x float matrices in registers.  However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
>>> 
>>> Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments.  This would allow using all 8 vector argument registers on X86 too.
>>> 
>>> Alignment of the matrix type is the same as the alignment of its first row/column vector.  
>>> 
>>> ** Flow **
>>> 
>>> Clang at this point mostly just forwards everything to LLVM.  Then in LLVM, we have an IR function pass that lowers the matrices to target-supported vectors.  As with vectors, matrices can be of any static size with any of the primitive types as the element type.
>>> 
>>> After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors.  CodeGen then lowers the arguments and forwards the vector values.  CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.
>>> 
>>> The lowering pass is also run at -O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors.  I don’t really see a big value of duplicating this logic across the IR and CodeGen.  We just need a lighter mode in the pass at -O0.
>>> 
>>> ** Roll-out and Maintenance **
>>> 
>>> Since this will be experimental for some time, I am planning to put this behind a flag: -fenable-experimental-matrix-type.  ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
>>> 
>>> We are obviously interested in maintaining and improving this code in the future.
>>> 
>>> Looking forward to comments and suggestions.
>>> 
>>> Thanks,
>>> Adam
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>> 
>> 
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181012/c8557497/attachment-0001.html>


More information about the llvm-dev mailing list