<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/131954>131954</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AMDGPU] Large mfma16x16x16 register tiles plus software pipelining leads to bad register spills
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          raiseirql
      </td>
    </tr>
</table>

<pre>
    When attempting to increase the block size used for a matmul on gfx942 to something exceeding 128x256, we (Modular) are hitting problems where the inner loop is generating many unnecessary vgpr/agpr moves. These moves degrade performance back to a level of 128x128.

In the example below, the code generates this at the top of the inner loop. The `v_accvgpr_read_b32` and `v_accvgpr_write_b32` instructions are unexpected.
```assembly
        v_accvgpr_write_b32 a41, 0
 v_accvgpr_write_b32 a40, 0
.LBB0_3:                                ; =>This Inner Loop Header: Depth=1
        v_accvgpr_read_b32 v33, a119
 v_accvgpr_read_b32 v37, a99
...
        v_accvgpr_read_b32 v50, a92
 s_waitcnt lgkmcnt(1)
        v_mfma_f32_16x16x16_bf16 a[92:95], v[26:27], v[14:15], a[112:115]
        v_accvgpr_read_b32 v56, a90
 v_accvgpr_write_b32 a107, v37
...
        v_accvgpr_read_b32 v39, a101
 v_accvgpr_read_b32 v38, a100
        v_mfma_f32_16x16x16_bf16 a[96:99], v[26:27], v[18:19], a[168:171]
        ds_read2_b64 v[30:33], v1 offset0:6 offset1:14
``` 

Removing the software pipelining intrinsics does remove these instructions, but at the cost of poorly scheduled code. I had noticed that a composable kernel example was able to use a 256x256 register block here: https://github.com/ROCm/composable_kernel/blob/develop/example/01_gemm/gemm_xdl_fp16_v2.cpp. This test uses 32x32x8 MFMA instructions instead. Digging into the IR passes, the reason this works is that for these instructions, the "tied" flag is set, so the input accum is tied to the output accum.

https://github.com/llvm/llvm-project/blob/2e39533e5055bc82bf5a1b2a888e5d65a33e72f9/llvm/lib/Target/AMDGPU/VOP3PInstructions.td#L629

If I change this intrinsic to unconditionally set `NoDstOverlap` then all of the above spill instructions are removed. One way to fix our issue is to add a set of "tied" intrinsics to mirror what was done for `int_amdgcn_wmma_f16_16x16x16_f16_tied`.

Using inline assembly would have the same "tied" effect, but we also need the MFMA instructions to be detected as `SchedGroupMask::MFMA` for the software pipelining to work.

Here is a reduced kernel that reproduces the problem with clang. Building this with commit `df790008` plus the above hack to force tied behavior then generates expected code.
```c++
// /opt/rocm/bin/amdclang++ -x hip --offload-device-only --offload-arch=gfx942 -O3 -S repro.cpp -o repro.s
#include "hip/hip_runtime.h"

typedef __bf16 float16x4_t __attribute__((ext_vector_type(4)));
typedef float float32x4_t __attribute__((ext_vector_type(4)));

__global__ void __launch_bounds__(256) kernel_16x16x16f16(float16x4_t* a, float16x4_t* b, float32x4_t* c)
{
    static constexpr int num_rows = 7;
    static constexpr int num_cols = 8;
    static constexpr int num_accums = num_rows * num_cols;

    __shared__ float16x4_t a_shared[num_rows];
    __shared__ float16x4_t b_shared[num_cols];

    float16x4_t a_vals[num_rows];
 float16x4_t b_vals[num_cols];
    float32x4_t c_accum[num_accums];

    if (threadIdx.x == 0) {
        #pragma unroll
        for (int m = 0; m < num_rows; m++) 
            a_shared[m] = a[m];
        #pragma unroll
 for (int n = 0; n < num_cols; n++) 
            b_shared[n] = b[n];
 }
    __syncthreads();

    #pragma unroll
    for (int m = 0; m < num_rows; m++) 
        a_vals[m] = a_shared[m];
    #pragma unroll
 for (int n = 0; n < num_cols; n++) 
        b_vals[n] = b_shared[n];
    
 #pragma unroll
    for (int i = 0; i < num_accums; i++)
 c_accum[i] = float32x4_t(0);

    __builtin_amdgcn_sched_barrier(0);

 for (int k = 0; k < 64; k++) {
        #pragma unroll
        for (int m = 0; m < num_rows; m++) {
            #pragma unroll
            for (int n = 0; n < num_cols; n++) {
                c_accum[n * num_cols + m] = 
 __builtin_amdgcn_mfma_f32_16x16x16bf16_1k(
 a_vals[m], b_vals[n], c_accum[n * num_cols + m], 0, 0, 0
 );
            }
        }

        #pragma unroll
        for (int m = 0; m < num_rows; m++) 
            a_vals[m] = a_shared[m];
 #pragma unroll
        for (int n = 0; n < num_cols; n++) 
 b_vals[n] = b_shared[n];
        
 __builtin_amdgcn_sched_group_barrier(0x008, 8, 0); // MFMA
 __builtin_amdgcn_sched_group_barrier(0x100, 1, 0); // DS_READ
    }

 __builtin_amdgcn_sched_barrier(0);

    #pragma unroll
    for (int i = 0; i < num_accums; i++) {
        c[threadIdx.x] = c_accum[i];
        c += 256;
    }
}
``` 
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJy8WEtz2zgS_jXwpcsqEjT1OPggRaMZVyWbqczM7pEFEk0RaxDgAqAk76_fapCSqMSeOPtyqSwRQD_wdfeHBoX3am8QH1m-Yfn2TvShse7RCeVRuX_ou9LKl8e_NWhAhIBtF5TZQ7CgTOVQeITQIJTaVs_g1T8Reo8SautAQCtC22uwBvb1afXASczbFkNDOvBUIUr6lfLliedzxj_AEYHx5Screy0c4ysQDqFRIVrtnC01th6ODbrBsDIGHWhrO1Ae9mjQibi2FeYFemOwQu-Fe4HDvnOM78S-c9DaA_oZ_N6gx-EBJO6dkAgdutq6VpgKoRTVM_ksQOMBNdg6upry5Ywla5asn0x0Ak-i7TRCidoeaRc0WFmJZ4fQQ2iUBxHiVLAd6br1P7oDbJ4cClFV5G3hUMiizDibJyCMvJ08OhXwPKuMD66vgrLGR8h6g6cOq4AyujpPho_wHttSv7BkDePfKxpBPKS0jYSWvT6fnOdnHzebpMhYdlH41h_LNsCyLct--p2weIo7_0iR-wWFREcqttiFhmXb9FUHz3jAIcvIvkjT1a2LkxWLuGJFC2az2Xf05cmwmtM6XxyFCpUJoPfPbWUC48uU8dWNjrZuRVFnvEjnp_gpyjqdg2D5ZsVZtl7lLN-S0gPLN3zOsjVfTEbSB5at0_MakkpTEkuHse94Ox-8fTs-aRK3Tyi8Z_vZaoAzSd-EczmuSN6NAu15tfpzFJa05dUUhXkcWqS3KEgffeFFOX-IklnCsnWWnXWlYOvaY6DR-fg7JUUP0-yHoWq_YGsPkcUaBG_rcKSK6VSHWhkaVyY4ZbyqPEiLHhwJRL7xeFNqZLvsw7muK-sDFXZnrdMv4KsGZa9RRi6YwRM0QoKxQVUoITQigIDKtp31otQIz-gM6gubHIWHOB4skSoI4PmceBIc7pUP6EbaJTKk4mlC6DzL1ozvGN_tVWj6clbZlvHdl88f6OtqrBiMMb4rtS0Z30liONsxvhvtM75L0mKPLQnSV3GSuqi7dF4c-KzqImEpDwF9IP88ZPyU8dMSPu0-rW8ZiR5QyBls1X4_ImwjZE9foCNS8mfapDPFmoEuj9Y9e-L1iBWdKa-HgOQY50GhZJxDrcWepDwGmvV2ZNqOIlVVfRtVKorBMGX7cJkbmf1PsNT6cP6675z9O1bhCiPHbJVnGeZJnpfVkpd1LtKSi-Vyibmc5yLLcMHr1USPIrnfhduTu7v1p-3Pv_7B-O6vn3_Nfn2a7HQWJOPZx3kkIjp8aniCqhFmjwNcl7SNGWMqa6QiSaEpGTHQ-fEXu_Xh8wGdFh1VRIgHu9bn80iUlOm-U1p_e6gMdSBn8NlQer6QnVqdwPYOlPc9RmAtCClBRIu2ngZmUlfBQqucsw6OFFvKdWkNxiCzeaJMKEQr95Upji1RTDq_Ugw9RJXzZIzWH35IKq0MwvmMg6PttYRGDKULXrQ3aYJ1HUM3lPARQWhvwWCsTXwli4OFEkFiiOcqCE-e_kZF_rOzffdJ-GfKmGxNogTumLGvckywMb3HDfxC_Qw1COBQ9kQPIxnEzHfYOUvDPqob2yA4qtBApYXZz2DTKy0HTqPCiTO2bVUMuqwXqyRJluRTp3s_iXQztji1dRUORVFiIw5qcN1MOphzQzGw2ZRXK8Y39EnGcgHGd7ajbHa2ohwvlaHWq5XR22E13J-gUR3c39u61lbIe4kHVeG9NfplMipcRR3B2EHef87g_rcBESIhuLfjg4_mM2Uq3csY6EYRnTWqK1xvgmpx1jDOB8DDS4cSayiGA4tMhXR-eigCFIUIwamyD1gUjC8ZX-IpFAesgnUFCTK-fKB-YPhkm4m-qGj4n_F_Xx1L1kWx17YUuijgYJWEotCiN1VTlLY30kddsWlejblyKZA6nTO-nGyJ8TUISvSvxsrL2OArjVVDo8MWm_H49UEEVUFlicNPnaMqBtO3hbNHTw0dLAaf_3RxZfWwePmOxZGIh-VXQ3x9UXTBiLQUhW-EQ1kUN1EU4zDLN2cd1CxcjL8hVt6KRWtnsVHy1spB0IpXbNxqnSy70XlROGZLNWx-XDoA8bUDikh1GRrqiJ7kaXYa2uotJJQM18jFpptnnRP7VkBvnNV6MhWpli8J9BYG8WwTf364wB5Hxuom3clNlz_BuGX5NioR48N1e2-5MbFvrvbNxf4YaTBv2p8G62y_HB9G-2yxvQb8xVQDaj7W4eoG1TeQ-s9QuuTHFZ9bzK4w_fchumbdBZxbwCbGCarvAqCuPqiLD2OO0tjFCxK9ZrI6m7-hmmXydQSKouyVDsqcj_7YQBelcE6h-0Zi4tjz1bHn6Nj8If6ewPK_rIob3d_R_5WNd4f1Gxv0N2GLG4IEOl8vKUeC32D7zc2tjD3WM1VGsr5N3NgjTZOJBr5re3hDcP0XU2x1ywsRrMX0pjc8_f8Y7H31-T4nfqRGf6A4zwX6Rn3sqfmcVsmJWj3-AZYj-Kv45mXozGJr-iO66MbPP0D6iq7tb8WXn9bn4F0D94N1_C7ufTf1fFUpFcs3k5PyjPUtOd1CXVEG0yLqrq4MOWxv-H95oXAnHzO5ylbiDh_TxQPP8iTh6V3zKHmeLrP5apFmi9ViXopS8jqbV0shSy4R79QjT3ieZOkq4ckiSWd5_K7TLF1Sr16zhwRbofSM7okz6_Z38Yr1mGbpKn-406JE7eNbW84NHocLGPW3-fbOPcbbadnvPXtItPLBX9UEFXR83TteNfMtfKTbJxAlnNng-o4hKI1-uDa8do_RdJzGq5GQV6F4gfR3vdOPP3yTjhvxjO_GnR4e-b8CAAD__6g13KY">