[PATCH] D70456: [Matrix] Add first set of matrix intrinsics and initial lowering pass.

Fri Apr 3 05:52:28 PDT 2020

fhahn marked an inline comment as done.
fhahn added a comment.

Hi Renato,

In D70456#1959243 <https://reviews.llvm.org/D70456#1959243>, @rengolin wrote:

> Hi Florian, cool work, thanks!
>
> I'm wondering how the vectoriser could profit from this.
>
> Currently, your patch is passing the intrinsic lowering pass before the vectoriser, so we'd see the long sequence of insert/extract element that we'd normally see anyway. Ie, it should have no functional difference if you lowered like that from the front-end.

Yes, the lowering as done in this patch could also have been done exclusively by the frontend without functional difference.

However, we since improved the lowering to propagate shape information from matrix intrinsics to connected operations. This shape information is then used to split up connected instructions (like regular IR loads/store/binary operators) to operate on column vectors, eliminating most insertelement/extractelement/shuffle instructions previously used to pack/unpack columns from the flat vector. For example, this can be seen in https://github.com/llvm/llvm-project/blob/master/llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll.

> However, it would perhaps be much easier to do loop tiling directly on the intrinsic, if we knew how to handle them. We could also directly take the vector dimension from the intrinsics to define how many native vector operations are needed (with padding, etc).

The aim/goal is to do exactly that and improve the lowering step-by-step. The lowering of matrix.multiply currently already tries to break down the operations according to the vector width of the target (https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L881), but there is lot of potential for further improvements.

> Do you plan to add support in the vectoriser, and if so, would that reduce / invalidate the need for the current lowering pass?

Currently I am working on adding initial tiling support for multiplies directly to the lowering pass: D75566 <https://reviews.llvm.org/D75566>.

One advantage of doing it in the lowering pass is that we have all information necessary available there and it is very specific to the intrinsic. Without lowering the intrinsic, there is no loop at all. (Even with the proposed tiling, there won't be any loops as the lowering effectively unrolls the tiled loops, but that will be improved in the future, as this approach is not practical for larger matrixes).

I think currently the general direction of the work is towards making the lowering pass better, rather than teaching other passes about the matrix intrinsics. I've also been thinking about using the infrastructure in the lowering pass to optimize large vector operations, even if no matrix intrinsics are involved. At the moment I am not sure how supporting matrix intrinsics would fit into passes like the loop vectorizer, but the lowering pass might be a good candidate to use VPlan for code generation/cost-modeling, once the infrastructure is there. Another direction to explore would be to detect loops that perform a matrix multiply and replacing them with a call to the intrinsic, which then gets further optimized.

Sorry for the somewhat lengthy response, but does the overall direction make sense to you?

Cheers,
Florian

================
Comment at: llvm/docs/LangRef.rst:14451
+
+      declare vectorty @llvm.matrix.multiply.*(vectorty %A, vectorty %B, i32 <M>, i32 <N>, i32 <K>)
+
----------------
LuoYuanke wrote:
> fhahn wrote:
> > LuoYuanke wrote:
> > > fhahn wrote:
> > > > LuoYuanke wrote:
> > > > > I have a question for the shape propagation. What if there is some conflict of the shape.  Take below code for example. The shape of matrix A is defined both by load and multiply. What is shape of matrix C and D? 4x4 or 5x5?
> > > > > 
> > > > > ```
> > > > > A = llvm.matrix.columnwise.load(ptr, stride, 5, 5);
> > > > > B = llvm.matrix.columnwise.load(ptr, stride, 5, 5);
> > > > > C = A + B
> > > > > llvm.matrix.multiply(A, B, 4, 4, 4);
> > > > > llvm.matrix.multiply(A, B, 5, 5 ,5);
> > > > > D = A - B
> > > > > 
> > > > > ```
> > > > I think the example above is not entirely valid as the loaded values are of type <25 x double>  and the first multiply would take arguments with <16 x double>. I've put up D77129 to update the verifier to reject such IR.
> > > > 
> > > >  However a conflict between shapes can occur, if the number of elements matches, as in the example below. In cases where the shape information inferred for an operand does not match the shape required at a use, we embed the operand into a flat vector and extract vectors of the required shape for that use (it is done in getMatrix  https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L331)
> > > > 
> > > > ```
> > > > A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3
> > > > llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires  2 x 3 - use A directly
> > > >                                      // second operand requires  3 x 2 - cannot use A directly; embedded A in flat vector and extract columns assuming 3 x 2.
> > > > ```
> > > > 
> > > > The specification of the intrinsics does not mention shape propagation and the shape imposed by the intrinsics only applies to the uses at the intrinsic calls. Shape propagation is only used for optimization purposes to better split up instructions that do not require shape info.
> > > > 
> > > > Does that make sense? It might be desirable to add an option to check for conflicts.
> > > I prefer to be more strict for the syntax.  The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined.  Maybe we can have some reshape operator to explicitly reshape the matrix.
> > > 
> > > 
> > > ```
> > > C = llvm.matrix.reshape(A, row, col);
> > > ```
> > > 
> > > ```
> > > A = llvm.matrix.columnwise.load(ptr, stride, 2, 3); // Shape of A = 2 x 3
> > > llvm.matrix.multiply(A, A, 2, 3, 2); // first operand requires  2 x 3 - use A directly
> > >                                                      // second operand requires  3 x 2 - illegal.
> > > ```
> > > I prefer to be more strict for the syntax. The example that you provide is also illegal to me. The matrix should be fixed when matrix is defined. Maybe we can have some reshape operator to explicitly reshape the matrix.
> > 
> > Unfortunately I think without a matrix type on the IR level, such a stricter syntax won't really be feasible. The intrinsics operate on arbitrary IR vectors and the operands/results can also be used by other instructions. At the moment, the matrix intrinsics and regular IR instructions interact nicely, I think exactly because they operate exclusively on flat vectors, with the intrinsics having some additional information. But the intrinsics themselves produce a flat vector again. This keeps the spec simple and also has the advantage of giving the optimizer freedom to do shape related optimizations.
> > 
> > A restriction as you mentioned would have to be enforced on the frontend side I think. For example, there should be no shape mismatches generated by Clang with our proposal/patches.
> > 
> > It might be good to add an option to the lowering pass to turn shape mismatches into error to verify a frontend does indeed not introduce conflicts. 
> > 
> > What you you think?
> > 
> > 
> > 
> > 
> It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed.
> About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump.
> BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix?
> It is nice that frontend can prevent such shape mismatch from happening. I'd like to see what the c/c++ looks like. Do you have any example code? It is good if the example code can cover all the matrix intrinsics that you proposed.

I have put up the latest version of the proposed spec for the C/C++ extensions on Phabricator: D76612. The actual clang patches to implement the proposal are also available, starting at D72281 (and the linked patches).
It introduces a matrix type on the C/C++ level and the matrix types must agree for all operands of operations. There is no syntax/builtin provided for converting between different shapes, other than going through memory or moving elements to a matrix value with the desired shape.

> About the option, I wonder if middle-end can report error elegantly like front-end without any abort or core dump.
No I don't think so. It would be solely helpful to verify the frontend does not introduce conflicts during testing.

> BTW, I'd like to explore the dynamic shape of Matrix. Do you have any ideas how to support dynamic shape of Matrix?
Interesting. I think that should ideally be driven be concrete use cases and I would like to hear more! Email or some other medium might be slightly better suited to discuss the topic though.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70456/new/

https://reviews.llvm.org/D70456