<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 6/19/20 9:31 PM, Baptiste Saleil via
cfe-dev wrote:<br>
</div>
<blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">
<div dir="ltr">Summary<br>
-------<br>
<br>
New Power ISA v3.1 [0] introduces instructions to accelerate
matrix<br>
multiplication. We want to expose these instructions through a
list of<br>
target-dependent builtins and new Clang types in the form of a
language<br>
extension. This RFC gives more details on the requirements for
these<br>
types and explains how we (IBM) are implementing them in Clang.<br>
<br>
We present the frontend implementation as an RFC because we need
to add<br>
target-specific checks in Sema and want to get feedback on our
implementation<br>
of these checks. The backend implementation does not impact the
other targets<br>
so it is not part of this RFC. Comments and questions are
welcome.<br>
<br>
Introduction<br>
------------<br>
<br>
The new instructions manipulate matrices that the CPU represents
by new 512-bit<br>
registers called `accumulators`. Copying matrices, modifying
values and<br>
extracting values of matrices may cause the CPU to copy values
from/to the<br>
matrix multiplication unit. To avoid degrading performance, we
thus want to<br>
minimize the number of times these operations are used. So the
user will be able<br>
to modify and extract values of the matrices and perform
computations with them<br>
by using the dedicated builtins only. The instructions are
designed to be used in<br>
computational kernels and we want to enforce that specific
workflow.<br>
<br>
Because of this restriction, we cannot rely on the
target-independent matrix<br>
types [1].</div>
</blockquote>
<p><br>
</p>
<p>If this is part of the documented system ABI, and what will be
supported by GCC, then we should support it too.</p>
<p>That having been said, I'm not convinced that this is a good
idea, and supporting the target-independent matrix types would be
better. I understand that the copying will be expensive, and is
something that should be avoided, but this is true to some extent
for everything: there are some usages that compile to machine code
efficiently and some that don't. We generally, however, favor the
ability to create abstractions that *can* be compiled efficiently
as part of expected use cases, even if we cannot guarantee that
all uses will produce efficient code. In his case, you're
prohibiting the creation of abstractions (by semantically
restricting to local variables) because you fear that not all uses
will compile to efficient code. Are there some other structural
reasons why supporting these are regular values would be
problematic?<br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">
<div dir="ltr"> We need to add a new target-dependent type and
restrict its use.<br>
We give more details on these restrictions below. To be able to
manipulate<br>
these matrices, we want to add the `__vector_quad` type to
Clang. This type<br>
would be a PowerPC-specific builtin type mapped to the new
512-bit registers.<br>
</div>
</blockquote>
<p><br>
</p>
<p>Okay.</p>
<p> -Hal<br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">
<div dir="ltr"><br>
Similarly, some of these instructions take 256-bit values that
must be stored<br>
in two consecutive VSX registers. To represent these values and
minimize the<br>
number of copies between VSX registers, we also want to add the
PowerPC-specific<br>
builtin type `__vector_pair` that would be mapped to consecutive
VSX registers.<br>
<br>
Value initialization<br>
--------------------<br>
<br>
The only way to initialize a `__vector_pair` is by calling a
builtin taking two<br>
128-bit vectors and assembling them to form a 256-bit pair. A
similar builtin<br>
exists to assemble four 128-bit vectors to form a 512-bit
`__vector_quad`:<br>
<br>
vector unsigned char v1 = ...;<br>
vector unsigned char v2 = ...;<br>
vector unsigned char v3 = ...;<br>
vector unsigned char v4 = ...;<br>
__vector_pair vp;<br>
__vector_quad vq;<br>
__builtin_mma_assemble_pair(&vp, v1, v2);<br>
__builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);<br>
<br>
The other way to initialize a `__vector_quad` is to call a
builtin mapped to an<br>
instruction generating a new value of this type:<br>
<br>
__vector_quad vq1;<br>
__builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1<br>
__vector_quad vq2;<br>
__builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated
in vq2<br>
<br>
Both `__vector_pair` and `__vector_quad` can also be loaded from
pointers that<br>
can potentially be casted from void or char pointers.<br>
<br>
Value extraction<br>
----------------<br>
<br>
The only way to extract values from a matrix is to call the
builtins<br>
disassembling `__vector_pair` and `__vector_quad` values back
into two<br>
and four 128-bit vectors respectively:<br>
<br>
vector unsigned char* vpr = ...;<br>
vector unsigned char* vqr = ...;<br>
__builtin_mma_disassemble_pair(vpr, &vp);<br>
__builtin_mma_disassemble_acc(vqr, &vq);<br>
<br>
Once the values are disassembled to vectors, the user can
extract values as<br>
usual, for example using the subscript operator on the vector
unsigned char<br>
values. So the typical workflow to efficiently use these
instructions in a<br>
kernel is to first initialize the matrices, then perform
computations and finally<br>
disassemble them to extract the result of the computations.
These three steps<br>
should be done using the provided builtins.<br>
<br>
Semantics<br>
---------<br>
<br>
To enforce using values of these types in kernels, thus to avoid
copies from/to<br>
the matrix multiplication unit, we want to prevent as many
implicit copies<br>
as possible. That means that it should only be possible to
declare values of<br>
these types as local variables. We want to prevent any other way
to declare and<br>
use non-pointer variables of these types (global variable,
function parameter,<br>
function return, etc...).<br>
<br>
The only situations in which these types and values of these
types can be<br>
used are:<br>
* Local variable declaration<br>
* Assignment operator<br>
* Builtin call parameter<br>
* Memory allocation<br>
* Typedef & alias<br>
<br>
Implementation<br>
--------------<br>
<br>
We have implemented the support of these types, builtins and
intrinsics in both<br>
Clang's frontend and the LLVM PowerPC backend. We will post the
backend<br>
implementation later. We implemented and tested this support
out-of-tree in<br>
conjunction with the GCC team to ensure a common API and ensure
source<br>
compatibility. For this RFC, we have 5 patches for the frontend:<br>
* Add options to control MMA support on PowerPC targets [2].<br>
* Define the two new types as Clang target-dependent builtin
types.<br>
As the other targets, we decided to define these types in a
separate<br>
`PPCtypes.def` file to improve extensibility in case we need
to add other<br>
PowerPC-specific types in the future [3].<br>
* Add the builtin definitions. These builtins use the two new
types,<br>
so they use custom type descriptors. To avoid pervasive
changes,<br>
we use custom decoding of these descriptors [4].<br>
* Add the Sema checks to restrict the use of the two types.<br>
We prevent the use of non-pointer values of these types in
any declaration<br>
that is not a local variable declaration. We also prevent
them to<br>
be passed as function arguments and to be returned from
functions [5].<br>
* Implement the minimal required changes to LLVM to support
the builtins.<br>
In this patch, we enable the use of v256i1 for intrinsic
arguments and<br>
define all the MMA intrinsics the builtins are mapped to
[6].<br>
<br>
The backend implementation should not impact other targets. We
do not plan to<br>
add any type to LLVM. `__vector_pair` and `__vector_quad` are
generated as<br>
`v256i1` and `v512i1` respectively (both are currently unused in
the PowerPC<br>
backend). VSX pair registers will be allocated to the `v256i1`
type and the<br>
new accumulator registers will be allocated to the `v512i1`
type.<br>
</div>
</blockquote>
<blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">
<div dir="ltr"><br>
[0] Power ISA v3.1, <a href="https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0" moz-do-not-send="true">https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0</a><br>
[1] <a href="https://clang.llvm.org/docs/MatrixTypes.html" moz-do-not-send="true">https://clang.llvm.org/docs/MatrixTypes.html</a><br>
[2] <a href="https://reviews.llvm.org/D81442" moz-do-not-send="true">https://reviews.llvm.org/D81442</a><br>
[3] <a href="https://reviews.llvm.org/D81508" moz-do-not-send="true">https://reviews.llvm.org/D81508</a><br>
[4] <a href="https://reviews.llvm.org/D81748" moz-do-not-send="true">https://reviews.llvm.org/D81748</a><br>
[5] <a href="https://reviews.llvm.org/D82035" moz-do-not-send="true">https://reviews.llvm.org/D82035</a><br>
[6] <a href="https://reviews.llvm.org/D81744" moz-do-not-send="true">https://reviews.llvm.org/D81744</a><br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
cfe-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a>
</pre>
</blockquote>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>