[cfe-dev] RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Fri Jul 24 08:48:28 PDT 2020

Hi,

On Tue, 21 Jul 2020 at 10:06, Florian Hahn <florian_hahn at apple.com> wrote:

> Hi,
>
> Sorry for jumping in a bit late, I missed the initial discussion.
>
> On Jul 21, 2020, at 05:13, Baptiste Saleil via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
> On Mon, 22 Jun 2020 at 19:01, Hal Finkel <hfinkel at anl.gov> wrote:
>
>> On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
>>
>> The new instructions manipulate matrices that the CPU represents by new
>> 512-bit
>> registers called `accumulators`. Copying matrices, modifying values and
>> extracting values of matrices may cause the CPU to copy values from/to the
>> matrix multiplication unit. To avoid degrading performance, we thus want
>> to
>> minimize the number of times these operations are used. So the user will
>> be able
>> to modify and extract values of the matrices and perform computations
>> with them
>> by using the dedicated builtins only. The instructions are designed to be
>> used in
>> computational kernels and we want to enforce that specific workflow.
>>
>> Because of this restriction, we cannot rely on the target-independent
>> matrix
>> types [1].
>>
>>
>> If this is part of the documented system ABI, and what will be supported
>> by GCC, then we should support it too.
>>
>> That having been said, I'm not convinced that this is a good idea, and
>> supporting the target-independent matrix types would be better. I
>> understand that the copying will be expensive, and is something that should
>> be avoided, but this is true to some extent for everything: there are some
>> usages that compile to machine code efficiently and some that don't. We
>> generally, however, favor the ability to create abstractions that *can* be
>> compiled efficiently as part of expected use cases, even if we cannot
>> guarantee that all uses will produce efficient code. In his case, you're
>> prohibiting the creation of abstractions (by semantically restricting to
>> local variables) because you fear that not all uses will compile to
>> efficient code. Are there some other structural reasons why supporting
>> these are regular values would be problematic?
>>
> Supporting these as regular values would be problematic for several
> reasons. These new accumulator registers are actually each associated with
> 4 of the existing 128-bit VSR vector registers. A particularity of MMA is
> that when an accumulator contains defined data, its 4 associated registers
> contain undefined data and cannot be used. When copying an accumulator, we
> need to:
>   1. Copy its value back to its four associated VSRs
>   2. Copy these 4 VSRs to the VSRs associated with the destination
> accumulator
>   3. Copy these VSRs to the destination accumulator
>   4. If the copy is not a kill, copy the 4 VSRs associated with the source
> back to the source accumulator
>
> So if these registers were supported as regular values, we would have
> really expensive copy (and also expensive function calls and returns) and
> we would prevent from using 4 vector registers per live accumulator. More
> importantly (something I should have mentioned in the RFC), the new
> instructions actually implement a single operation that is the outer
> product. That means that supporting these as regular values would imply
> copying accumulators back to their associated VSRs and generating non-MMA
> instructions for any other operation anyway. Therefore, it is likely that
> programs using matrices would actually be less efficient.
>
>
> Form the reasoning above, it sounds like there seem to be no structural
> reasons that prevent using the matrix type extension, unless I am missing
> something.
>
> But if I understand correctly, the main motivation for introducing the new
> types with the additional restrictions is to prevent users from writing
> code that cannot be mapped directly to the hardware?
>
> In particular, is a concern with the matrix types extension that a user
> could specify a matrix operation that cannot be mapped directly to the MMA
> extension, e.g. a multiple of 13 x 7 float matrixes?
> And specify costly accesses, for example repeated access to elements that
> live in different VSR registers?
>

Not really. The problem is that MMA allows storing matrices in the new
registers but actually supports no operation on these matrices.
The two only things it can do is compute the outer product of two vectors
then store the result as a matrix and compute the outer product
of two vectors then add the result to a given matrix. (no copy, no
addition, no element access, no transpose, etc...).
So if we use the new MMA registers for the matrix extension, *all* the
operations except matrix multiplication would actually be slower than
without MMA.
For example, accessing a single element of a matrix would imply copying the
accumulator register to its four VSR registers, extracting the element from
one of the VSR then copying the four VSRs to the accumulator (whereas we
just extract an element from a VSR without MMA).
Similarly for binary operations, we would need to copy the accumulator to
its VSRs, do the operation on the VSRs, and copy them back to the
accumulator.

That's the reason why we think it is better to support the matrix extension
through the llvm.matrix.multiply only. That way, the multiplication is
accelerated without the need to write target-dependent code and there is no
negative impact on the other operations.

The motivation to add target-dependent types is that users who want to
explicitly generate code to accelerate matrix multiplication on PowerPC
(typically linear algebra library developers) can do so without the need to
write inline assembly . And the additional restrictions are added to help
them to write these kernels with optimal performance, e.g. by preventing
copies.

>
> However, although we're not planning on supporting the target-independent
> matrix types for these reasons, we're not excluding supporting the
> target-independent matrix operations. We are exploring implementing the
> target-independent matrix multiplication operation with MMA kernels. That
> way, on PowerPC, programs using target-independent matrix types and
> operations would actually benefit from MMA for matrix multiplication with
> no additional effort.
>
>
> We recently started working on providing some infrastructure to allow for
> target-specific lowering upstream. Any collaboration on that front would be
> very welcome, to make sure things are general enough to support many
> different hardware extensions.
>

Thanks, we'll take a look at that.

Baptiste.

>
> Cheers,
> Florian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200724/3e2a91fd/attachment.html>