<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 6/19/20 9:31 PM, Baptiste Saleil via

      cfe-dev wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">

      <div dir="ltr">Summary<br>

        -------<br>

        <br>

        New Power ISA v3.1 [0] introduces instructions to accelerate

        matrix<br>

        multiplication. We want to expose these instructions through a

        list of<br>

        target-dependent builtins and new Clang types in the form of a

        language<br>

        extension. This RFC gives more details on the requirements for

        these<br>

        types and explains how we (IBM) are implementing them in Clang.<br>

        <br>

        We present the frontend implementation as an RFC because we need

        to add<br>

        target-specific checks in Sema and want to get feedback on our

        implementation<br>

        of these checks. The backend implementation does not impact the

        other targets<br>

        so it is not part of this RFC. Comments and questions are

        welcome.<br>

        <br>

        Introduction<br>

        ------------<br>

        <br>

        The new instructions manipulate matrices that the CPU represents

        by new 512-bit<br>

        registers called `accumulators`. Copying matrices, modifying

        values and<br>

        extracting values of matrices may cause the CPU to copy values

        from/to the<br>

        matrix multiplication unit. To avoid degrading performance, we

        thus want to<br>

        minimize the number of times these operations are used. So the

        user will be able<br>

        to modify and extract values of the matrices and perform

        computations with them<br>

        by using the dedicated builtins only. The instructions are

        designed to be used in<br>

        computational kernels and we want to enforce that specific

        workflow.<br>

        <br>

        Because of this restriction, we cannot rely on the

        target-independent matrix<br>

        types [1].</div>

    </blockquote>

    <p><br>

    </p>

    <p>If this is part of the documented system ABI, and what will be

      supported by GCC, then we should support it too.</p>

    <p>That having been said, I'm not convinced that this is a good

      idea, and supporting the target-independent matrix types would be

      better. I understand that the copying will be expensive, and is

      something that should be avoided, but this is true to some extent

      for everything: there are some usages that compile to machine code

      efficiently and some that don't. We generally, however, favor the

      ability to create abstractions that *can* be compiled efficiently

      as part of expected use cases, even if we cannot guarantee that

      all uses will produce efficient code. In his case, you're

      prohibiting the creation of abstractions (by semantically

      restricting to local variables) because you fear that not all uses

      will compile to efficient code. Are there some other structural

      reasons why supporting these are regular values would be

      problematic?<br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">

      <div dir="ltr"> We need to add a new target-dependent type and

        restrict its use.<br>

        We give more details on these restrictions below. To be able to

        manipulate<br>

        these matrices, we want to add the `__vector_quad` type to

        Clang. This type<br>

        would be a PowerPC-specific builtin type mapped to the new

        512-bit registers.<br>

      </div>

    </blockquote>

    <p><br>

    </p>

    <p>Okay.</p>

    <p> -Hal<br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">

      <div dir="ltr"><br>

        Similarly, some of these instructions take 256-bit values that

        must be stored<br>

        in two consecutive VSX registers. To represent these values and

        minimize the<br>

        number of copies between VSX registers, we also want to add the

        PowerPC-specific<br>

        builtin type `__vector_pair` that would be mapped to consecutive

        VSX registers.<br>

        <br>

        Value initialization<br>

        --------------------<br>

        <br>

        The only way to initialize a `__vector_pair` is by calling a

        builtin taking two<br>

        128-bit vectors and assembling them to form a 256-bit pair. A

        similar builtin<br>

        exists to assemble four 128-bit vectors to form a 512-bit

        `__vector_quad`:<br>

        <br>

        vector unsigned char v1 = ...;<br>

        vector unsigned char v2 = ...;<br>

        vector unsigned char v3 = ...;<br>

        vector unsigned char v4 = ...;<br>

        __vector_pair vp;<br>

        __vector_quad vq;<br>

        __builtin_mma_assemble_pair(&vp, v1, v2);<br>

        __builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);<br>

        <br>

        The other way to initialize a `__vector_quad` is to call a

        builtin mapped to an<br>

        instruction generating a new value of this type:<br>

        <br>

        __vector_quad vq1;<br>

        __builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1<br>

        __vector_quad vq2;<br>

        __builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated

        in vq2<br>

        <br>

        Both `__vector_pair` and `__vector_quad` can also be loaded from

        pointers that<br>

        can potentially be casted from void or char pointers.<br>

        <br>

        Value extraction<br>

        ----------------<br>

        <br>

        The only way to extract values from a matrix is to call the

        builtins<br>

        disassembling `__vector_pair` and `__vector_quad` values back

        into two<br>

        and four 128-bit vectors respectively:<br>

        <br>

        vector unsigned char* vpr = ...;<br>

        vector unsigned char* vqr = ...;<br>

        __builtin_mma_disassemble_pair(vpr, &vp);<br>

        __builtin_mma_disassemble_acc(vqr, &vq);<br>

        <br>

        Once the values are disassembled to vectors, the user can

        extract values as<br>

        usual, for example using the subscript operator on the vector

        unsigned char<br>

        values. So the typical workflow to efficiently use these

        instructions in a<br>

        kernel is to first initialize the matrices, then perform

        computations and finally<br>

        disassemble them to extract the result of the computations.

        These three steps<br>

        should be done using the provided builtins.<br>

        <br>

        Semantics<br>

        ---------<br>

        <br>

        To enforce using values of these types in kernels, thus to avoid

        copies from/to<br>

        the matrix multiplication unit, we want to prevent as many

        implicit copies<br>

        as possible. That means that it should only be possible to

        declare values of<br>

        these types as local variables. We want to prevent any other way

        to declare and<br>

        use non-pointer variables of these types (global variable,

        function parameter,<br>

        function return, etc...).<br>

        <br>

        The only situations in which these types and values of these

        types can be<br>

        used are:<br>

          * Local variable declaration<br>

          * Assignment operator<br>

          * Builtin call parameter<br>

          * Memory allocation<br>

          * Typedef & alias<br>

        <br>

        Implementation<br>

        --------------<br>

        <br>

        We have implemented the support of these types, builtins and

        intrinsics in both<br>

        Clang's frontend and the LLVM PowerPC backend. We will post the

        backend<br>

        implementation later. We implemented and tested this support

        out-of-tree in<br>

        conjunction with the GCC team to ensure a common API and ensure

        source<br>

        compatibility. For this RFC, we have 5 patches for the frontend:<br>

          * Add options to control MMA support on PowerPC targets [2].<br>

          * Define the two new types as Clang target-dependent builtin

        types.<br>

            As the other targets, we decided to define these types in a

        separate<br>

            `PPCtypes.def` file to improve extensibility in case we need

        to add other<br>

            PowerPC-specific types in the future [3].<br>

          * Add the builtin definitions. These builtins use the two new

        types,<br>

            so they use custom type descriptors. To avoid pervasive

        changes,<br>

            we use custom decoding of these descriptors [4].<br>

          * Add the Sema checks to restrict the use of the two types.<br>

            We prevent the use of non-pointer values of these types in

        any declaration<br>

            that is not a local variable declaration. We also prevent

        them to<br>

            be passed as function arguments and to be returned from

        functions [5].<br>

          * Implement the minimal required changes to LLVM to support

        the builtins.<br>

            In this patch, we enable the use of v256i1 for intrinsic

        arguments and<br>

            define all the MMA intrinsics the builtins are mapped to

        [6].<br>

        <br>

        The backend implementation should not impact other targets. We

        do not plan to<br>

        add any type to LLVM. `__vector_pair` and `__vector_quad` are

        generated as<br>

        `v256i1` and `v512i1` respectively (both are currently unused in

        the PowerPC<br>

        backend). VSX pair registers will be allocated to the `v256i1`

        type and the<br>

        new accumulator registers will be allocated to the `v512i1`

        type.<br>

      </div>

    </blockquote>

    <blockquote type="cite" cite="mid:CA+JOuH1Pv8rO907gGFYa7uk0Ya4zwNGOxGe7gakx3r8ce1SJJg@mail.gmail.com">

      <div dir="ltr"><br>

        [0] Power ISA v3.1, <a href="https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0" moz-do-not-send="true">https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0</a><br>

        [1] <a href="https://clang.llvm.org/docs/MatrixTypes.html" moz-do-not-send="true">https://clang.llvm.org/docs/MatrixTypes.html</a><br>

        [2] <a href="https://reviews.llvm.org/D81442" moz-do-not-send="true">https://reviews.llvm.org/D81442</a><br>

        [3] <a href="https://reviews.llvm.org/D81508" moz-do-not-send="true">https://reviews.llvm.org/D81508</a><br>

        [4] <a href="https://reviews.llvm.org/D81748" moz-do-not-send="true">https://reviews.llvm.org/D81748</a><br>

        [5] <a href="https://reviews.llvm.org/D82035" moz-do-not-send="true">https://reviews.llvm.org/D82035</a><br>

        [6] <a href="https://reviews.llvm.org/D81744" moz-do-not-send="true">https://reviews.llvm.org/D81744</a><br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

cfe-dev mailing list

<a class="moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a>

</pre>

    </blockquote>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>