<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/64816>64816</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            [OpenMP] amdgpu bad choice of max_flat_workgroup_size

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          ye-luo

      </td>

    </tr>

</table>

<pre>

    Currently clang sets `max_flat_workgroup_size` always to 1024 and causes register spill

```

 .max_flat_workgroup_size: 1024

    .name: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413

 .private_segment_fixed_size: 264

    .sgpr_count:     60

 .sgpr_spill_count: 0

    .symbol: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd

 .vgpr_count:     128

    .vgpr_spill_count: 66

    .wavefront_size: 64

```

I tested overriding the default using `ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))`

and got 2x kernel speed-up.

```

 .max_flat_workgroup_size: 256

    .name: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413

 .private_segment_fixed_size: 0

    .sgpr_count:     58

 .sgpr_spill_count: 0

    .symbol: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd

 .vgpr_count:     166

    .vgpr_spill_count: 0

    .wavefront_size: 64

```

The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.

1. Can max_flat_workgroup_size be chosen at linking when the needed vgpr got figured out?

2. When I specify `thread_limit(192)` clause, can the compiler take advantage of it?

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUls9u4zYQxp-GvgxiUJStSAcd0o0NGEV3g7pAgV4IShxJbChS5R8l7tMXlJPYbTeLbdHD1pBtSENy5ht_P9LCe9UbxJpsvyPb-5WIYbCuPuGNjnbVWHmqP0Tn0AR9glYL04PH4IEUdBTPvNMi8CfrHntn48S9-h1JQUHoJ3HyECxklG1AGAmtiB49OOyVD-jAT0prQu8JvSMFfbmWW1i_t3J-t6z3MgwA1kaMy2PO7Thx23XaCqlMz3PGb0VOb29byfkvH7Pst7GddPTpnd2iMn7SyiD3k01TD90uK8eog-I4Cx1FQD73w-7H74-hmLEN1h0ePh55cXz4dMSwO4rDccN3Kc74EsmyB-GCajW-xu_4bqcarjdZ_ipscmpOS3vsRzSBd-oZ5Zs2VlxL8_3keGujCSmWXpcGLbGlg5cR9HruaWys_h80Zv0oXzXNf9ObsfJK0_wZzUVxNeBJzNg5a8JbQ1_7-ReDnT8PENAHlGBndE6l5kAYECR2IuoA0acnpKB2nJ65CMGpJgYkrOSXO84JKwkrxSj7KV5My69cy8okhH0Ati0Iqy7Xaz0Jj94GYM_wiM6gBj8hyps4rf85ICnJ1_KxEYjb4huwwdfxQb9Ix7b8r-j4ZtryZTr-ZP7P0UH_PRw_XYGwbOHKQ6tROH2CGd0JGiFBGQhDCgiPa_h5QAMHaK1EaKMUyfE4o3mb7-M0WRdQpsgBHAqtTxA9nkc0McBofTpnMlaCdcnKLwBka_ggDLxje2gQ2sF6NCACaGUeE7lPqZ4EtEGUKCF1aMGsU310ifsYSL4_J2Bv9fsJW9WdEvhhcCgk12pUIVFcsTO16RyMHpOMVpxztHaclEYHQTwiCDkLE0SPYDtQlywrWeeyyiuxwjorKpaXZVnlq6Euym2DSLfFVhaikFXDKknpbUurXBabMlupmlGW0zIrGaVZnq-rHIuyxU4K1lSUItlQHIXSa63ncW1dv1LeR6zT7GKlRYPaL2c8YwafYAkSxtKR7-o056aJvScbqpUP_rJKUEEvfw4-TWh-eCDbezjvdYsD2sGqdlH5zk-zik7XQwiTJ_kdYXvC9r0KQ2zWrR0J26c8L183k7O_YhsI2y_VecL2S_V_BAAA___eVc3n">