[cfe-dev] RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Mon Jul 27 10:28:03 PDT 2020

On 7/20/20 11:13 PM, Baptiste Saleil wrote:
>
>
> On Mon, 22 Jun 2020 at 19:01, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
>>     Summary
>>     -------
>>
>>     New Power ISA v3.1 [0] introduces instructions to accelerate matrix
>>     multiplication. We want to expose these instructions through a
>>     list of
>>     target-dependent builtins and new Clang types in the form of a
>>     language
>>     extension. This RFC gives more details on the requirements for these
>>     types and explains how we (IBM) are implementing them in Clang.
>>
>>     We present the frontend implementation as an RFC because we need
>>     to add
>>     target-specific checks in Sema and want to get feedback on our
>>     implementation
>>     of these checks. The backend implementation does not impact the
>>     other targets
>>     so it is not part of this RFC. Comments and questions are welcome.
>>
>>     Introduction
>>     ------------
>>
>>     The new instructions manipulate matrices that the CPU represents
>>     by new 512-bit
>>     registers called `accumulators`. Copying matrices, modifying
>>     values and
>>     extracting values of matrices may cause the CPU to copy values
>>     from/to the
>>     matrix multiplication unit. To avoid degrading performance, we
>>     thus want to
>>     minimize the number of times these operations are used. So the
>>     user will be able
>>     to modify and extract values of the matrices and perform
>>     computations with them
>>     by using the dedicated builtins only. The instructions are
>>     designed to be used in
>>     computational kernels and we want to enforce that specific workflow.
>>
>>     Because of this restriction, we cannot rely on the
>>     target-independent matrix
>>     types [1].
>
>
>     If this is part of the documented system ABI, and what will be
>     supported by GCC, then we should support it too.
>
>     That having been said, I'm not convinced that this is a good idea,
>     and supporting the target-independent matrix types would be
>     better. I understand that the copying will be expensive, and is
>     something that should be avoided, but this is true to some extent
>     for everything: there are some usages that compile to machine code
>     efficiently and some that don't. We generally, however, favor the
>     ability to create abstractions that *can* be compiled efficiently
>     as part of expected use cases, even if we cannot guarantee that
>     all uses will produce efficient code. In his case, you're
>     prohibiting the creation of abstractions (by semantically
>     restricting to local variables) because you fear that not all uses
>     will compile to efficient code. Are there some other structural
>     reasons why supporting these are regular values would be problematic?
>
> Supporting these as regular values would be problematic for several 
> reasons. These new accumulator registers are actually each associated 
> with 4 of the existing 128-bit VSR vector registers. A particularity 
> of MMA is that when an accumulator contains defined data, its 4 
> associated registers contain undefined data and cannot be used. When 
> copying an accumulator, we need to:
>   1. Copy its value back to its four associated VSRs
>   2. Copy these 4 VSRs to the VSRs associated with the destination 
> accumulator
>   3. Copy these VSRs to the destination accumulator
>   4. If the copy is not a kill, copy the 4 VSRs associated with the 
> source back to the source accumulator
>
> So if these registers were supported as regular values, we would have 
> really expensive copy (and also expensive function calls and returns

I don't see why you call four vector moves or memory access expensive? 
What's the alternative? The programmer needs to move the data around 
somehow anyway if that's the thing that they need to do.

> ) and we would prevent from using 4 vector registers per live 
> accumulator. More importantly (something I should have mentioned in 
> the RFC), the new instructions actually implement a single operation 
> that is the outer product. That means that supporting these as regular 
> values would imply copying accumulators back to their associated VSRs 
> and generating non-MMA instructions for any other operation anyway. 
> Therefore, it is likely that programs using matrices would actually be 
> less efficient.

I'm assuming that you'll model the registers explicitly (using 
RegisterTuples or similar in TableGen), so you'll end up with a 
collection of registers that alias appropriately with their underlying 
VSRs, and the general infrastructure will handle the details of copying, 
killing, and so on. Is that correct?

If we add patterns for, say, adding, that use subregister extraction 
aong with the underlying VSR instructions, then hopefully the 
infrastructure will coalesce away any unnecessary copies and we'll get 
the right "in place" matrix addition. To say that the types support only 
outer product is, based on my interpretation of your description, 
technically correct, but on the other hand, elementwise operations 
(e.g., add) can be directly supported using the underlying operations on 
the VSRs at reasonably-low cost. Is this correct?

I'm not sure exactly what our legalization framework does for matrix 
types currently, but presumably it should handle expansion for the other 
cases.

Regardless of what the frontend accepts, I would prefer to see, where 
possible, types modeled using generic LLVM types and operations.

>
> However, although we're not planning on supporting the 
> target-independent matrix types for these reasons, we're not excluding 
> supporting the target-independent matrix operations. We are exploring 
> implementing the target-independent matrix multiplication operation 
> with MMA kernels. That way, on PowerPC, programs using 
> target-independent matrix types and operations would actually benefit 
> from MMA for matrix multiplication with no additional effort.

That sounds good to me.

Thanks again,

Hal

>
> Baptiste.
>
>
>>     We need to add a new target-dependent type and restrict its use.
>>     We give more details on these restrictions below. To be able to
>>     manipulate
>>     these matrices, we want to add the `__vector_quad` type to Clang.
>>     This type
>>     would be a PowerPC-specific builtin type mapped to the new
>>     512-bit registers.
>
>
>     Okay.
>
>      -Hal
>
>
>>
>>     Similarly, some of these instructions take 256-bit values that
>>     must be stored
>>     in two consecutive VSX registers. To represent these values and
>>     minimize the
>>     number of copies between VSX registers, we also want to add the
>>     PowerPC-specific
>>     builtin type `__vector_pair` that would be mapped to consecutive
>>     VSX registers.
>>
>>     Value initialization
>>     --------------------
>>
>>     The only way to initialize a `__vector_pair` is by calling a
>>     builtin taking two
>>     128-bit vectors and assembling them to form a 256-bit pair. A
>>     similar builtin
>>     exists to assemble four 128-bit vectors to form a 512-bit
>>     `__vector_quad`:
>>
>>     vector unsigned char v1 = ...;
>>     vector unsigned char v2 = ...;
>>     vector unsigned char v3 = ...;
>>     vector unsigned char v4 = ...;
>>     __vector_pair vp;
>>     __vector_quad vq;
>>     __builtin_mma_assemble_pair(&vp, v1, v2);
>>     __builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);
>>
>>     The other way to initialize a `__vector_quad` is to call a
>>     builtin mapped to an
>>     instruction generating a new value of this type:
>>
>>     __vector_quad vq1;
>>     __builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1
>>     __vector_quad vq2;
>>     __builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated in vq2
>>
>>     Both `__vector_pair` and `__vector_quad` can also be loaded from
>>     pointers that
>>     can potentially be casted from void or char pointers.
>>
>>     Value extraction
>>     ----------------
>>
>>     The only way to extract values from a matrix is to call the builtins
>>     disassembling `__vector_pair` and `__vector_quad` values back
>>     into two
>>     and four 128-bit vectors respectively:
>>
>>     vector unsigned char* vpr = ...;
>>     vector unsigned char* vqr = ...;
>>     __builtin_mma_disassemble_pair(vpr, &vp);
>>     __builtin_mma_disassemble_acc(vqr, &vq);
>>
>>     Once the values are disassembled to vectors, the user can extract
>>     values as
>>     usual, for example using the subscript operator on the vector
>>     unsigned char
>>     values. So the typical workflow to efficiently use these
>>     instructions in a
>>     kernel is to first initialize the matrices, then perform
>>     computations and finally
>>     disassemble them to extract the result of the computations. These
>>     three steps
>>     should be done using the provided builtins.
>>
>>     Semantics
>>     ---------
>>
>>     To enforce using values of these types in kernels, thus to avoid
>>     copies from/to
>>     the matrix multiplication unit, we want to prevent as many
>>     implicit copies
>>     as possible. That means that it should only be possible to
>>     declare values of
>>     these types as local variables. We want to prevent any other way
>>     to declare and
>>     use non-pointer variables of these types (global variable,
>>     function parameter,
>>     function return, etc...).
>>
>>     The only situations in which these types and values of these
>>     types can be
>>     used are:
>>       * Local variable declaration
>>       * Assignment operator
>>       * Builtin call parameter
>>       * Memory allocation
>>       * Typedef & alias
>>
>>     Implementation
>>     --------------
>>
>>     We have implemented the support of these types, builtins and
>>     intrinsics in both
>>     Clang's frontend and the LLVM PowerPC backend. We will post the
>>     backend
>>     implementation later. We implemented and tested this support
>>     out-of-tree in
>>     conjunction with the GCC team to ensure a common API and ensure
>>     source
>>     compatibility. For this RFC, we have 5 patches for the frontend:
>>       * Add options to control MMA support on PowerPC targets [2].
>>       * Define the two new types as Clang target-dependent builtin types.
>>         As the other targets, we decided to define these types in a
>>     separate
>>         `PPCtypes.def` file to improve extensibility in case we need
>>     to add other
>>         PowerPC-specific types in the future [3].
>>       * Add the builtin definitions. These builtins use the two new
>>     types,
>>         so they use custom type descriptors. To avoid pervasive changes,
>>         we use custom decoding of these descriptors [4].
>>       * Add the Sema checks to restrict the use of the two types.
>>         We prevent the use of non-pointer values of these types in
>>     any declaration
>>         that is not a local variable declaration. We also prevent them to
>>         be passed as function arguments and to be returned from
>>     functions [5].
>>       * Implement the minimal required changes to LLVM to support the
>>     builtins.
>>         In this patch, we enable the use of v256i1 for intrinsic
>>     arguments and
>>         define all the MMA intrinsics the builtins are mapped to [6].
>>
>>     The backend implementation should not impact other targets. We do
>>     not plan to
>>     add any type to LLVM. `__vector_pair` and `__vector_quad` are
>>     generated as
>>     `v256i1` and `v512i1` respectively (both are currently unused in
>>     the PowerPC
>>     backend). VSX pair registers will be allocated to the `v256i1`
>>     type and the
>>     new accumulator registers will be allocated to the `v512i1` type.
>>
>>     [0] Power ISA v3.1,
>>     https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0
>>     <https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0>
>>     [1] https://clang.llvm.org/docs/MatrixTypes.html
>>     <https://clang.llvm.org/docs/MatrixTypes.html>
>>     [2] https://reviews.llvm.org/D81442 <https://reviews.llvm.org/D81442>
>>     [3] https://reviews.llvm.org/D81508 <https://reviews.llvm.org/D81508>
>>     [4] https://reviews.llvm.org/D81748 <https://reviews.llvm.org/D81748>
>>     [5] https://reviews.llvm.org/D82035 <https://reviews.llvm.org/D82035>
>>     [6] https://reviews.llvm.org/D81744 <https://reviews.llvm.org/D81744>
>>
>>     _______________________________________________
>>     cfe-dev mailing list
>>     cfe-dev at lists.llvm.org  <mailto:cfe-dev at lists.llvm.org>
>>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev  <https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200727/63ccba79/attachment-0001.html>