<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/131415>131415</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[cuclang 20.1.0] Using __reduce_max_sync in a cuda kernel fails with an illegal instruction was encountered
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
AustinSchuh
</td>
</tr>
</table>
<pre>
Using clang 20.1.0. I've got a CUDA kernel which boils down to the following:
```
__global__ void testcode(const float* data, unsigned *max_value) {
unsigned r = static_cast<unsigned>(data[threadIdx.x]);
const unsigned mask = __ballot_sync(0xFFFFFFFF, true);
unsigned mx = __reduce_max_sync(mask, r);
atomicMax(max_value, mx);
}
```
When I run this with nvcc, it works, and with clang, I get an `an illegal instruction was encountered` message.
Compiler explorer confirms that they generate slightly different ptx.
nvcc
```
ld.param.u64 %rd1, [testcode(float const*, unsigned int*)_param_0];
ld.param.u64 %rd2, [testcode(float const*, unsigned int*)_param_1];
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mov.u32 %r1, %tid.x;
mul.wide.u32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
ld.global.f32 %f1, [%rd6];
cvt.rzi.u32.f32 %r2, %f1;
mov.pred %p1, -1;
mov.u32 %r3, -1;
vote.sync.ballot.b32 %r4, %p1, %r3;
redux.sync.max.u32 %r5, %r2, %r4;
atom.global.max.u32 %r6, [%rd3], %r5;
ret;
```
clang
```
ld.param.u64 %rd1, [testcode(float const*, unsigned int*)_param_0];
ld.param.u64 %rd2, [testcode(float const*, unsigned int*)_param_1];
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mov.u32 %r1, %tid.x;
mul.wide.u32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
ld.global.f32 %f1, [%rd6];
cvt.rzi.u32.f32 %r2, %f1;
mov.pred %p1, -1;
vote.sync.ballot.b32 %r3, %p1, -1;
redux.sync.max.u32 %r4, %r3, %r2;
atom.global.max.u32 %r5, [%rd3], %r4;
ret;
```
I hacked around and got it to work with:
```
__global__ void testcode2(const float* data, unsigned int* max_value) {
unsigned int r = static_cast<unsigned int>(data[threadIdx.x]);
unsigned int mask = __ballot_sync(0xFFFFFFFF, true);
unsigned int mx;
asm volatile(
" redux.sync.max.u32 %0, %2, %1;"
: "=r"(mx) // Output operand (mask)
: "r"(mask), "r"(r)// No input operands
: "cc" // Clobbered registers and condition codes
);
atomicMax(max_value, mx);
}
```
The fix being to swap the mask and input on the redux.sync.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzMV12P6rYW_TXmZetEiZPA8MADA3ekebj3PrRHfUSObRJ3HBvZO8D011d2PoChc0pPW7UREjPJ3mt_rrDMvFe1kXJFymdSbmesw8a61brzqMwPvOmaWWXF--qrV6YGrpmpgaZJlqQJwCuhi6OE2iIw2HzdruFNOiM1nBrFG6is0h6EPRlAC9hI2Fut7UmZmuRrksbPPB0-6Xq3q7WtmN7t4GiVAJQeuRWS0CdujUfYa8uQ0DUIhozQDXQmZi-A0HXLzrsj050kdAlk8UzSNQBcTByQfAseGSq-48wjyTfjQ5L_h9CniFo-Y-MkE6_inJxJuSV0SfLnPtmA12cyobbMv0Xg3a5iWlvc-XfDCX1Kzy_DFRJFF_O6QbpgnAcEJ0XH5S5UMqAE-ODvJufgydC2iv-XnaPFVPYG2vMlyGL7ob0kXf_USAOv4DoD2CgPJ4UNmCPnwVkhnKx78-FvZkT_ME483HmFWiIwA2SeMgNKa1kzDcp4dB1HZQ2cmAdpuO0MSicFmafQSu9ZLZM-_Ma2B6WlA3k-aOukC93cK9d6wIZhWJF3qKWRjqEEr1XdoH4HofZ76aRBOOB5gIpJ39YHw6VFcmCOtUk3L8L_hJZOZKGGMNzLTsVt6udJ6PpmnZTpby13EWmXhkWY2v9pEPqngmR3QfgRWYI26WkxhhqC5TFYH_Zxr2Lyym69WntMupzGZ9lghEok5w9mnU5OSshoO4KWI2h0LG49mBCJnxe95XwKf0mkvOvskPk-hiC03I_TGzCmPvEjJu4XFZIZrC-l0iHA_jcKPTgprmwPEf_LaDi24hotv7EYrqNFmQSmJj33k-rSlLG-w9hMlw_OgeXn3q1l56npUxPHxN3HRqJtx9aMjmO0-XWD8vja2gyoNxhO4vB6uH0x9DT_t_DpryXTP8Gi69X5W9h0HeCPs-pbnLrU__3UuufUA4zJbxgzeX5Cl-JCq4k3D9Ol_IwuxWN0eYWG8TcpgDnbGRF_LoMKUhiUTvgVjb-eg8p5QOLQBzROv-LwezpHGfym1ok4D-idO9DvkTr3IIEB8PgVMJhv4Wg1Q6XDW-ABj2lX6Sfrkw4DHzc5Lhul1775OviTfOvCF32K2goIfSH0Bf7f4aFDsAfpwuwnpba8S6GHGTF6oz7oeDNqux72fxaUuQL2t2UNYEGu0TGTjbZVFeQWOFkrj9L5uI7cGqGiLAv75YeU-sFcEnxUUEKvKCenD3z4MWh7dYZKhlMCWvAndoiKPy5NyGeoy8S7V0OZiVUulvmSzeQqWxQ0TXNaLGbNar7fMzl_EsV8seSLgu8zVjKe5ZJnVVkxPlMrmtIyzbMyzYoFXSRlyRldVrQQfM7TdE6KVLZM6UTrY5tYV8-U951cZXlWZOVMs0pqHw8-lBp5gvg0zKTcztwqOH2putqTItXKo7_AoEIdT0y8uz4RkXIL_TnpTsqDMsCAd4KNR6Q9C4ejqLEfk9OzzulVg3jw4bUSh18rbLoq4bYl9CVkN3x9OTj7s-RI6EusyRP6MhR9XNFfAwAA__8L_u2I">