<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/64816>64816</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [OpenMP] amdgpu bad choice of max_flat_workgroup_size
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          ye-luo
      </td>
    </tr>
</table>

<pre>
    Currently clang sets `max_flat_workgroup_size` always to 1024 and causes register spill
```
 .max_flat_workgroup_size: 1024
    .name: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
 .private_segment_fixed_size: 264
    .sgpr_count:     60
 .sgpr_spill_count: 0
    .symbol: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
 .vgpr_count:     128
    .vgpr_spill_count: 66
    .wavefront_size: 64
```

I tested overriding the default using `ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))`
and got 2x kernel speed-up.
```
 .max_flat_workgroup_size: 256
    .name: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
 .private_segment_fixed_size: 0
    .sgpr_count:     58
 .sgpr_spill_count: 0
    .symbol: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
 .vgpr_count:     166
    .vgpr_spill_count: 0
    .wavefront_size: 64
```

The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.
1. Can max_flat_workgroup_size be chosen at linking when the needed vgpr got figured out?
2. When I specify `thread_limit(192)` clause, can the compiler take advantage of it?

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUls9u4zYQxp-GvgxiUJStSAcd0o0NGEV3g7pAgV4IShxJbChS5R8l7tMXlJPYbTeLbdHD1pBtSENy5ht_P9LCe9UbxJpsvyPb-5WIYbCuPuGNjnbVWHmqP0Tn0AR9glYL04PH4IEUdBTPvNMi8CfrHntn48S9-h1JQUHoJ3HyECxklG1AGAmtiB49OOyVD-jAT0prQu8JvSMFfbmWW1i_t3J-t6z3MgwA1kaMy2PO7Thx23XaCqlMz3PGb0VOb29byfkvH7Pst7GddPTpnd2iMn7SyiD3k01TD90uK8eog-I4Cx1FQD73w-7H74-hmLEN1h0ePh55cXz4dMSwO4rDccN3Kc74EsmyB-GCajW-xu_4bqcarjdZ_ipscmpOS3vsRzSBd-oZ5Zs2VlxL8_3keGujCSmWXpcGLbGlg5cR9HruaWys_h80Zv0oXzXNf9ObsfJK0_wZzUVxNeBJzNg5a8JbQ1_7-ReDnT8PENAHlGBndE6l5kAYECR2IuoA0acnpKB2nJ65CMGpJgYkrOSXO84JKwkrxSj7KV5My69cy8okhH0Ati0Iqy7Xaz0Jj94GYM_wiM6gBj8hyps4rf85ICnJ1_KxEYjb4huwwdfxQb9Ix7b8r-j4ZtryZTr-ZP7P0UH_PRw_XYGwbOHKQ6tROH2CGd0JGiFBGQhDCgiPa_h5QAMHaK1EaKMUyfE4o3mb7-M0WRdQpsgBHAqtTxA9nkc0McBofTpnMlaCdcnKLwBka_ggDLxje2gQ2sF6NCACaGUeE7lPqZ4EtEGUKCF1aMGsU310ifsYSL4_J2Bv9fsJW9WdEvhhcCgk12pUIVFcsTO16RyMHpOMVpxztHaclEYHQTwiCDkLE0SPYDtQlywrWeeyyiuxwjorKpaXZVnlq6Euym2DSLfFVhaikFXDKknpbUurXBabMlupmlGW0zIrGaVZnq-rHIuyxU4K1lSUItlQHIXSa63ncW1dv1LeR6zT7GKlRYPaL2c8YwafYAkSxtKR7-o056aJvScbqpUP_rJKUEEvfw4-TWh-eCDbezjvdYsD2sGqdlH5zk-zik7XQwiTJ_kdYXvC9r0KQ2zWrR0J26c8L183k7O_YhsI2y_VecL2S_V_BAAA___eVc3n">