[cfe-dev] RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Fri Jun 19 19:31:25 PDT 2020

Summary
-------

New Power ISA v3.1 [0] introduces instructions to accelerate matrix
multiplication. We want to expose these instructions through a list of
target-dependent builtins and new Clang types in the form of a language
extension. This RFC gives more details on the requirements for these
types and explains how we (IBM) are implementing them in Clang.

We present the frontend implementation as an RFC because we need to add
target-specific checks in Sema and want to get feedback on our
implementation
of these checks. The backend implementation does not impact the other
targets
so it is not part of this RFC. Comments and questions are welcome.

Introduction
------------

The new instructions manipulate matrices that the CPU represents by new
512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be
able
to modify and extract values of the matrices and perform computations with
them
by using the dedicated builtins only. The instructions are designed to be
used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1]. We need to add a new target-dependent type and restrict its use.
We give more details on these restrictions below. To be able to manipulate
these matrices, we want to add the `__vector_quad` type to Clang. This type
would be a PowerPC-specific builtin type mapped to the new 512-bit
registers.

Similarly, some of these instructions take 256-bit values that must be
stored
in two consecutive VSX registers. To represent these values and minimize the
number of copies between VSX registers, we also want to add the
PowerPC-specific
builtin type `__vector_pair` that would be mapped to consecutive VSX
registers.

Value initialization
--------------------

The only way to initialize a `__vector_pair` is by calling a builtin taking
two
128-bit vectors and assembling them to form a 256-bit pair. A similar
builtin
exists to assemble four 128-bit vectors to form a 512-bit `__vector_quad`:

vector unsigned char v1 = ...;
vector unsigned char v2 = ...;
vector unsigned char v3 = ...;
vector unsigned char v4 = ...;
__vector_pair vp;
__vector_quad vq;
__builtin_mma_assemble_pair(&vp, v1, v2);
__builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);

The other way to initialize a `__vector_quad` is to call a builtin mapped
to an
instruction generating a new value of this type:

__vector_quad vq1;
__builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1
__vector_quad vq2;
__builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated in vq2

Both `__vector_pair` and `__vector_quad` can also be loaded from pointers
that
can potentially be casted from void or char pointers.

Value extraction
----------------

The only way to extract values from a matrix is to call the builtins
disassembling `__vector_pair` and `__vector_quad` values back into two
and four 128-bit vectors respectively:

vector unsigned char* vpr = ...;
vector unsigned char* vqr = ...;
__builtin_mma_disassemble_pair(vpr, &vp);
__builtin_mma_disassemble_acc(vqr, &vq);

Once the values are disassembled to vectors, the user can extract values as
usual, for example using the subscript operator on the vector unsigned char
values. So the typical workflow to efficiently use these instructions in a
kernel is to first initialize the matrices, then perform computations and
finally
disassemble them to extract the result of the computations. These three
steps
should be done using the provided builtins.

Semantics
---------

To enforce using values of these types in kernels, thus to avoid copies
from/to
the matrix multiplication unit, we want to prevent as many implicit copies
as possible. That means that it should only be possible to declare values of
these types as local variables. We want to prevent any other way to declare
and
use non-pointer variables of these types (global variable, function
parameter,
function return, etc...).

The only situations in which these types and values of these types can be
used are:
  * Local variable declaration
  * Assignment operator
  * Builtin call parameter
  * Memory allocation
  * Typedef & alias

Implementation
--------------

We have implemented the support of these types, builtins and intrinsics in
both
Clang's frontend and the LLVM PowerPC backend. We will post the backend
implementation later. We implemented and tested this support out-of-tree in
conjunction with the GCC team to ensure a common API and ensure source
compatibility. For this RFC, we have 5 patches for the frontend:
  * Add options to control MMA support on PowerPC targets [2].
  * Define the two new types as Clang target-dependent builtin types.
    As the other targets, we decided to define these types in a separate
    `PPCtypes.def` file to improve extensibility in case we need to add
other
    PowerPC-specific types in the future [3].
  * Add the builtin definitions. These builtins use the two new types,
    so they use custom type descriptors. To avoid pervasive changes,
    we use custom decoding of these descriptors [4].
  * Add the Sema checks to restrict the use of the two types.
    We prevent the use of non-pointer values of these types in any
declaration
    that is not a local variable declaration. We also prevent them to
    be passed as function arguments and to be returned from functions [5].
  * Implement the minimal required changes to LLVM to support the builtins.
    In this patch, we enable the use of v256i1 for intrinsic arguments and
    define all the MMA intrinsics the builtins are mapped to [6].

The backend implementation should not impact other targets. We do not plan
to
add any type to LLVM. `__vector_pair` and `__vector_quad` are generated as
`v256i1` and `v512i1` respectively (both are currently unused in the PowerPC
backend). VSX pair registers will be allocated to the `v256i1` type and the
new accumulator registers will be allocated to the `v512i1` type.

[0] Power ISA v3.1,
https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0
[1] https://clang.llvm.org/docs/MatrixTypes.html
[2] https://reviews.llvm.org/D81442
[3] https://reviews.llvm.org/D81508
[4] https://reviews.llvm.org/D81748
[5] https://reviews.llvm.org/D82035
[6] https://reviews.llvm.org/D81744
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200619/62e70146/attachment-0001.html>