<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/54644>54644</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
loop-unrolling anti-optimisation at -O2
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
andyhhp
</td>
</tr>
</table>
<pre>
Sorry for the generic title, but I can't think of a better categorisation (other than perhaps "code generation so bad I'd like my money back" :smiley:).
Full example: https://godbolt.org/z/MKc7Meh5x
This is a real piece of logic for auditing guest updates to a register (x86's MSR_PAT specifically), which is a slowpath in traditional virtualisation, but a fairly fastpath for nested virtualisation. Given:
```c
#include <stdint.h>
int check_pat(uint64_t val)
{
unsigned int i;
for ( i = 0; i < 8; i++, val >>= 8 )
{
switch ( val & 0xff )
{
case 0:
case 1:
case 4:
case 5:
case 6:
case 7:
continue;
default:
return 0;
}
}
return 1;
}
```
the code generation at `-O1` looks pretty good. GCC manages slightly better (by dropping another conditional jump in the loop body) but it's a simple loop. (More on this later).
```
check_pat: # @check_pat
movl $8, %eax
jmp .LBB0_1
.LBB0_4: # in Loop: Header=BB0_1 Depth=1
shrq $8, %rdi
decl %eax
je .LBB0_5
.LBB0_1: # =>This Inner Loop Header: Depth=1
leal -4(%rdi), %ecx
cmpb $4, %cl
jb .LBB0_4
cmpb $1, %dil
jbe .LBB0_4
xorl %eax, %eax
retq
.LBB0_5:
movl $1, %eax
retq
```
However, the code generation at `-O2` is outrageous:
```
check_pat: # @check_pat
xorl %eax, %eax
cmpb $7, %dil
ja .LBB0_16
movzbl %dil, %ecx
movl $243, %edx
btl %ecx, %edx
jae .LBB0_16
movq %rdi, %rcx
shrq $8, %rcx
cmpb $7, %cl
ja .LBB0_16
movzbl %cl, %ecx
movl $243, %edx
btl %ecx, %edx
jae .LBB0_16
# <snip lots of repeats>
movq %rdi, %rcx
shrq $59, %rcx
jne .LBB0_16
sarq $56, %rdi
movl .Lswitch.table.check_pat(,%rdi,4), %eax
.LBB0_16:
retq
.Lswitch.table.check_pat:
.long 1 # 0x1
.long 1 # 0x1
.long 0 # 0x0
.long 0 # 0x0
.long 1 # 0x1
.long 1 # 0x1
.long 1 # 0x1
.long 1 # 0x1
```
which has a number of issues
1. The loop is unrolled (again, more on this later)
2. Each iteration of the loop (and therefore duplicated 8 times), the same constant is reloaded into `%edx` despite the register not being clobbered
3. The return value for the function is a 0 from `xor %eax` in the first instruction, or picked up as a 1 from the `.Lswitch.table.check_pat:`. I don't even know what to call this transformation, but it would be *far* better replaced with `bt` as per earlier iterations, and a single `setc %al` to drop the memory load and 32 byte(!) table.
This loop should not be unrolled at any optimisation level. It's a fixed number of iterations with a simple induction variable, so can be predicted perfectly on even ~10yo hardware. The loop carry dependency is trivial, as it's data shifted out of `val` one byte at a time, and there's no latency-sensitive work which can be shuffled earlier. Furthermore, decode bandwidth which would be decoding beyond the loop is wasted re-decoding the same uops which could be served from the uop cache.
Genuinely, the `-O1` code generation is far preferable to anything that higher optimisation levels spit out, in terms of binary size, runtime speed, and power utilisation. (This example is too small, but longer loops which fit in the uop cache will allow power savings.)
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzFWFtz2zYT_TXyC8YakaJk-0EPviRtpsn0m695z4AkKMKGABYAdcmv71mApClZcjOdXjiyLJLYxdmzu8AuclMeVr8Zaw-sMpb5WrC10MLKgnnplZikjyxvPfvECq4n6Y3HEKlfmKkYZ7nwXli88WJtrHTcS6PZJL010EPKuGaNsDVvHJ6mhSk77XGgMyznJfsEtSVT8kWwzYFtjBYHPC9eIMEm83u3kUoc8GOS3k0ns6fJ7D5-f2yVYmLPNw1gzu9Z7X3jwriP-KxNmRvlp8aucfcdf19-KW6-iHqxHyv5WkvH8OHMCq5YI0UhyDhl1qCAKOFtKb3Ua7ZuhfOsbUqY65g3QWYtHVEAm_e3Sxji2Jff_v_tf_dfmWtEIStZcKUOgE5E7mpZ1HE2p8yu4R53mnnLaQqjAWArrW-56rjs2ees4tIq-Ig7H8QImQYeUZ6ITBn7SW6FJiKinctZ_BTdfTqXulAtfDGZPzpfSu2n9WT-YUwLnrGiFsXLN8wG41o8WGbfPNtyRcbEoTcP8QfD1Won1xpwSFRO5g9jdTSAEEMTk5j2ic0wIvx8ZLfh5yR9CJ9HmoIRHPo8sVs2zEdajuaky-2kB6mkOQimSzbbV9Wx1FnJgjtBOO7PPE7OP87OP16cf7w8__jmzePwymhEWSvOMEdXKSreKn9W1ArfWh0oPTX5aczc06niTjB5nXMY0wfNWIQWh9Mc5p5h1PWvCb6RNObFsQZq_YGtjSkpGB8f2YZrvkbOOCXXtUcYdwsHnJYfWGlN01CCcR3XDVAx5MNzu2lCjmBuqG9YjgULvg1pIX3IOGSTpEUgDMCUUPvFWKQxiSHbFBLWjlaPY9tew3x-xOrbC5nDJtnsVeCI7I3Zqjgqu6UwnqQLwffHY55hDF3Tzw8Ps29JfBlvsj-dPgJgxMZnGErjfxa8hGnzp6COPYnGI5OfkpMUqe3vJ8hsKU_jq-jQn0Et2Aj1Yow6-UHUAIV8DovtJ43gCRYM8O8vIVe0JuO6zuDTDnZcSglncYKz2DR5Z2fWjSnUiSn52JTsonzSyZfyjQJxWcHe2DGJF4IA6fH7mMO3q8colJI_13I2V382O7GlqH9k76VtSmkLp5gWm9BamNa93Tf-3jT5MY5Grri56Ao-9mWyfMPh91yxXvJCyIyYTrN5P6o8GZV7xXrMxf7SqGcu3sfTJWGM4piGp3jOpeo7YX5zKcx_lJriP2RmKEdCHaJlg_XbOyq-rGgE9-6kJvnLTC7uLg161u-5zPFXFctLC2fP0_RzrESmnudKTMfFE0QHrNlo_eqDfpj9dB0YrxQXtJ-KTJXBTspYwt6_iPbZPvkHhGc_JDz7B4T_O5v_PeGTVTl2FDWnIki3mxw7K_JHOodeJY5IpuxrXzthoW-1NUqhTEdc8jWXocXYnKuWong6ZR84NS2-3zqgfyjGSIku6d6KipSUbaMk9YMlCncvN0AR451EHN_QTqSd59QjOMS3MqgAQs9gaEPqFg_sSaVwDeYMckOXhfoQpSOVioUyOYwVZUQ5j0Z29Sz6gFYM7WzV6iIAD33XjFXWbGiufehHQhbSHhhLzEpa9HgSEG1b9B0YBjayeAHOtmGB6iRqIQkIv5eceMvQPJcmNs_YkzV70WaHXhB7MKym_jASjx1YO6DeHLV-0rOdaVUJwwH3vuLwzX1fQGOhVLwAMMxfE5TckzHAiLabCW6VxP_Bd46UksOoYtZrFdA74QsiAo0dRIGI6vFg2kYgMA6MfBSk5inLD16EFS2hGjza-6adDqHh6oA6uuw17GA01wdmGgRHf2CgwIoimvpyvpJ7DB3F82BANHQo-KUuo5vgcysJDVnoiFVN06IVKWVB0Qg-KlFQ74HBwQuTmw_J7GCQPLbccSsAYEiUgtOBSIltSJdCFwcW3CO3kof9Evx2rUfJPcDUsqI5UEQRWtC4jWQabC_EWLA6pEPvgJAxQYM2IeMwybUT6KE9mnd43L50pwWdJa5uq4oI7JwKtB9bS2ooeUktKngq8nKo38kSLEX5IXjCe8qdXBxMhDCsCjseThKsuB5GDQnbmsb1WHpdTtgtxg9J0AbOEPhHwfCT0K3Ugo4_Hvtc6TrF03oUGBDZ5K8Kj-DHcL6iD3TYRFhAYI3ekcLhTeSgr8RSQezTPJTIICXUEbnUHH508ntgyLaafEAnM1g4Ok80KJQta70cHaAgwkMgd4dLwfvGMLdBrvZ5SYs3BInBnp5K-n4dGQhBwCK_IYeUj1M5voVNbooEuipX8_JufsevwlnbipRdx1SJ_bCX10f2ggbU7FetVauT0y6kRZtPC7PBjVLb_t91Y80z4h633aaQflxkyyy7qld3WVYt78q7slgkvCzyNKluk2VR3t5lVS4Wt1eK52B3NVk8TNJUi13cV_B7sni6kqt0lqaz-XyWZPMsm01v04KLBV-IeTnPlnyBFkBsuFRTwkHHcFd2FSDl7drhJfj27vUld-H8SITpoJ-3vjZ2BRcd6rq5ClOvAvQ_ABjO1Xo">