<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/54226>54226</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Missed optimization in arm64 neon zip operation
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
uncleasm
</td>
</tr>
</table>
<pre>
As in https://godbolt.org/z/xoz5sbTes
the produced code seems slightly unoptimal, as the task should be achievable with a single instruction
`zip1 v0.16b, v0.16b, v1.16b`
The problem of course is, that the 16-byte register, which shadows the 8-byte register has the top portion undefined -- however the instruction does not actually use those bytes. The various methods proposed below try to workaround the issue, convincing the backend to produce the optimal code, but as can be seen, they fail. One possibility would be to introduce a new intrinsic for the purpose.
```
#include "arm_neon.h"
using u16 = uint8_t __attribute__((vector_size(16)));
using u8 = uint8_t __attribute__((vector_size(8)));
uint8x16_t zip(uint8x8_t a, uint8x8_t b) {
auto A = vzip_u8(a,b);
return vcombine_u8(A.val[0], A.val[1]);
}
zip1 v2.8b, v0.8b, v1.8b
zip2 v0.8b, v0.8b, v1.8b
mov v2.d[1], v0.d[0]
mov v0.16b, v2.16b
uint8x16_t zip2(uint8x8_t a, uint8x8_t b) {
auto A = vcombine_u8(a,vdup_n_u8(0));
auto B = vcombine_u8(b,vdup_n_u8(0));
return vzip1q_u8(A,B);
}
zip2 v2.8b, v0.8b, v1.8b
zip1 v0.8b, v0.8b, v1.8b
mov v0.d[1], v2.d[0]
u16 zip3(u8 a, u8 b) {
return u16{
a[0],b[0],
a[1],b[1],
a[2],b[2],
a[3],b[3],
a[4],b[4],
a[5],b[5],
a[6],b[6],
a[7],b[7]
};
}
zip2 v2.8b, v0.8b, v1.8b
zip1 v0.8b, v0.8b, v1.8b
mov v0.d[1], v2.d[0]
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzNVkuTozYQ_jX40jWUEAbjAwdPpvaWymXvLgGyUQYkogde-9enJfBie8e1MzmFwpiWvn5-kppKNedyZ0BIaK0dTJTuIvoN76NqKtXZWOkjShf8_VCXzFTfuYnIW0R2tuUwaNW4mjdQq4aD4bw3YDpxbG13BifVYEXPuoj-AcyAV7DMvINplesaqDiwuhV8ZFXH4SRsCwyMkEeUhDRWu9oKJWFyF-XkIoYERhIneeVN3rwl4S0nMzQ8v0_xoe0e1AEjdNqgYeMVbMtsiCfJX6qz5aD5URjLtZ88taJuMUjWqNMUdXEPgvaajRpgUDpE6WTDD0JiLV5eoFUnPiLQg25TaRQ3IJXFxK1jnS8SxmRbhU_vwsTgwx6ZFsoZ6DlONcanMSDEl6xTJ7D6jK7hpPQ70wodT36McdzHXys5ClljIcN4xep37jHqSlcYnrkJzHmtyllPUs2kJwaplFOh-BkOTHQx_CWxnsoYUYlO2DO6n0lEw0La2TQDyU9BxrxFDQc1VWFw2qcQ3zLkCZvuSaQpht05XEkRpUz3e8mVjFsU7rTC0_mFAi7JIUrfwKHDYm8B9ntm0TUmw_f7iBZ4j7y2Su-NuGCeRZJHdDvf6eudreKLpooPLHnlH0mO-rhYETMNeHvMl3MRcd1uIdrMeoAXc1jIXYhhROW98x69VnXrwSM1t05LGGvVV7jiJuQuHnGnZa8kyt68q6ucBHkxEG3eFkv-CrsKr5HGxXVfFddthS-PYBrA5BPgXo0wW25-RhJUmmugH-OXjU3D2y3vdxWm_73Ed8XzmmPjhr2cZPLA60_l1w-Uq08oXynzxf5nJgzVXn_HDP0KM8nXmSH3zNAHZuaa4y5D86kvdjFXufigvHOSCL8bDsVbVma1vP6KSRZM8gxDFwx9hkkXTPoMs14w62eYbMFkzzD5gsmfYTYLZnO37j3p_-cF8HBEh-eqKdNmm27Zygrb8fJPbD3YnEJHERcWOh1-T-AJnq_Bn-E-NpzmOsytnO7Kh48NbP6uinFbodB14_XvBTvW33joohjaGzbvb9ma0nzVls12fSCsyLNNRniWc8JZXhA_VFeHzTpddQzbpSkxH-wgoStNHdIvmpUoKaGUpCRLSFIkm3hbkPSQ8W1RMMa3m3W0Jrz3jc_H4b-CVroMIVXuaHCyw08Bs0wybIxHyXlwh_bxtGiVLh02NM5Mvwq-yxD7v_VQrVE">