<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/118413>118413</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[AArch64] Suboptimal abs-diff codegen
</td>
</tr>
<tr>
<th>Labels</th>
<td>
backend:AArch64,
missed-optimization
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Kmeakin
</td>
</tr>
</table>
<pre>
https://godbolt.org/z/Wh9sE4754
https://alive2.llvm.org/ce/z/N6uVzT
In the 32-bit and 64-bit cases, `tgt` is obviously better than `src`, since it is one instruction shorter:
```asm
src_u32:
sub w8, w1, w0
subs w9, w0, w1
csel w0, w9, w8, hi
ret
tgt_u32:
subs w8, w0, w1
cneg w0, w8, lo
ret
```
In the 8-bit and 16-bit cases, `src` and `tgt` have the same number of instructions, but `tgt` has more ILP than `src`:
```asm
src_u8:
and w8, w0, #0xff
sub w8, w8, w1, uxtb
cmp w8, #0
cneg w0, w8, mi
ret
tgt_u8:
and w8, w0, #0xff
sub w9, w0, w1
cmp w8, w1, uxtb
cneg w0, w9, ls
ret
```
I suspect the code generated for `tgt` in the 128-bit cases is not optimal either
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyMVFGPozYQ_jXmxcrKjAmBBx7obSOdWlWVWrWPlW0m4B7gyGPn7vbXVzjshWRv1YuiEDHffP7mm_EoItvPiA3b_8T2z5mKYXC--WVC9cnOmXbd12YI4UxMtgyODI6967Qbw5PzPYPjC4Pj30NNPxeHfcFEe49Vo70gPI3jZVrxBtek38r418ufTLRMtB9nHgbkEnbaBq7mjpdF-msUITH4wFkpQh9YKbgl7vTFukjjV64xBPQ8DGpeIOQNK8WCJzsb5DYk-IzczhR8NMG6mdPgfEC_iBTtgk9fRRMTLXnzT5RwjfH1Q1Gn5-dqof6cp1-xAChqSpF6fXeN31IN4ZgA19AVlmgGu8A8hqsFoQ_fPZg2B3-HfsZ-Q59go9sQv1Z353P1zea8fGPz1cMUvZk-qAumVFIT8jlOGj13p62tKV_HcJdFfHIe-cdff3_s0bvmV6sFi4DH2hlI8eV0er81mwbFL0FvnZrOa3gh-R8LJ7sB3Leouu_QovIHFb47ItP5zXS9in8QlyhGeivuoc2cIp3RhNQy4zrkPc7oVcCOn5zf3qbrRORQ3QZhuTOzC9ydg53UyNGGAX3WNbKrZa0ybPKDlHkFsi6yoSkEoAGNJeSi013ZFdoYA4VU4iBVfchsAwKKHASAlDIvnsrKqBoKXe1rIfd4YIXASdnx257ILFHEJs-rIpfZqDSOlBYUgFbmE84dk23bejOUBQNg8IEBTJYIu11SbV_UMpNLbP-c-WYh3unYEyvEaCnQ7ahgw5i23yvf_pn_EfVr8UrTrrOnU7KxxzmLfnxciDYMUT8ZNzE4LrzrY3f27l80gcEx1UMMjmtJlwb-CwAA__9MYJFI">