<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/64816>64816</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[OpenMP] amdgpu bad choice of max_flat_workgroup_size
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
ye-luo
</td>
</tr>
</table>
<pre>
Currently clang sets `max_flat_workgroup_size` always to 1024 and causes register spill
```
.max_flat_workgroup_size: 1024
.name: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
.private_segment_fixed_size: 264
.sgpr_count: 60
.sgpr_spill_count: 0
.symbol: __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
.vgpr_count: 128
.vgpr_spill_count: 66
.wavefront_size: 64
```
I tested overriding the default using `ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))`
and got 2x kernel speed-up.
```
.max_flat_workgroup_size: 256
.name: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
.private_segment_fixed_size: 0
.sgpr_count: 58
.sgpr_spill_count: 0
.symbol: __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
.vgpr_count: 166
.vgpr_spill_count: 0
.wavefront_size: 64
```
The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.
1. Can max_flat_workgroup_size be chosen at linking when the needed vgpr got figured out?
2. When I specify `thread_limit(192)` clause, can the compiler take advantage of it?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUls9u4zYQxp-GvgxiUJStSAcd0o0NGEV3g7pAgV4IShxJbChS5R8l7tMXlJPYbTeLbdHD1pBtSENy5ht_P9LCe9UbxJpsvyPb-5WIYbCuPuGNjnbVWHmqP0Tn0AR9glYL04PH4IEUdBTPvNMi8CfrHntn48S9-h1JQUHoJ3HyECxklG1AGAmtiB49OOyVD-jAT0prQu8JvSMFfbmWW1i_t3J-t6z3MgwA1kaMy2PO7Thx23XaCqlMz3PGb0VOb29byfkvH7Pst7GddPTpnd2iMn7SyiD3k01TD90uK8eog-I4Cx1FQD73w-7H74-hmLEN1h0ePh55cXz4dMSwO4rDccN3Kc74EsmyB-GCajW-xu_4bqcarjdZ_ipscmpOS3vsRzSBd-oZ5Zs2VlxL8_3keGujCSmWXpcGLbGlg5cR9HruaWys_h80Zv0oXzXNf9ObsfJK0_wZzUVxNeBJzNg5a8JbQ1_7-ReDnT8PENAHlGBndE6l5kAYECR2IuoA0acnpKB2nJ65CMGpJgYkrOSXO84JKwkrxSj7KV5My69cy8okhH0Ati0Iqy7Xaz0Jj94GYM_wiM6gBj8hyps4rf85ICnJ1_KxEYjb4huwwdfxQb9Ix7b8r-j4ZtryZTr-ZP7P0UH_PRw_XYGwbOHKQ6tROH2CGd0JGiFBGQhDCgiPa_h5QAMHaK1EaKMUyfE4o3mb7-M0WRdQpsgBHAqtTxA9nkc0McBofTpnMlaCdcnKLwBka_ggDLxje2gQ2sF6NCACaGUeE7lPqZ4EtEGUKCF1aMGsU310ifsYSL4_J2Bv9fsJW9WdEvhhcCgk12pUIVFcsTO16RyMHpOMVpxztHaclEYHQTwiCDkLE0SPYDtQlywrWeeyyiuxwjorKpaXZVnlq6Euym2DSLfFVhaikFXDKknpbUurXBabMlupmlGW0zIrGaVZnq-rHIuyxU4K1lSUItlQHIXSa63ncW1dv1LeR6zT7GKlRYPaL2c8YwafYAkSxtKR7-o056aJvScbqpUP_rJKUEEvfw4-TWh-eCDbezjvdYsD2sGqdlH5zk-zik7XQwiTJ_kdYXvC9r0KQ2zWrR0J26c8L183k7O_YhsI2y_VecL2S_V_BAAA___eVc3n">