<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/54644>54644</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            loop-unrolling anti-optimisation at -O2
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          andyhhp
      </td>
    </tr>
</table>

<pre>
    Sorry for the generic title, but I can't think of a better categorisation (other than perhaps "code generation so bad I'd like my money back" :smiley:).

Full example: https://godbolt.org/z/MKc7Meh5x

This is a real piece of logic for auditing guest updates to a register (x86's MSR_PAT specifically), which is a slowpath in traditional virtualisation, but a fairly fastpath for nested virtualisation.  Given:
```c
#include <stdint.h>

int check_pat(uint64_t val)
{
    unsigned int i;

    for ( i = 0; i < 8; i++, val >>= 8 )
    {
        switch ( val & 0xff )
        {
        case 0:
        case 1:
        case 4:
        case 5:
        case 6:
        case 7:
            continue;

        default:
            return 0;
        }
    }

    return 1;
}
```

the code generation at `-O1` looks pretty good.  GCC manages slightly better (by dropping another conditional jump in the loop body) but it's a simple loop.  (More on this later).
```
check_pat:                              # @check_pat
        movl    $8, %eax
        jmp     .LBB0_1
.LBB0_4:                                #   in Loop: Header=BB0_1 Depth=1
        shrq    $8, %rdi
        decl    %eax
        je      .LBB0_5
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        leal    -4(%rdi), %ecx
        cmpb    $4, %cl
        jb      .LBB0_4
        cmpb    $1, %dil
        jbe     .LBB0_4
        xorl    %eax, %eax
        retq
.LBB0_5:
        movl    $1, %eax
        retq
```

However, the code generation at `-O2` is outrageous:
```
check_pat:                              # @check_pat
        xorl    %eax, %eax
        cmpb    $7, %dil
        ja      .LBB0_16
        movzbl  %dil, %ecx
        movl    $243, %edx
        btl     %ecx, %edx
        jae     .LBB0_16
        movq    %rdi, %rcx
        shrq    $8, %rcx
        cmpb    $7, %cl
        ja      .LBB0_16
        movzbl  %cl, %ecx
        movl    $243, %edx
        btl     %ecx, %edx
        jae     .LBB0_16

# <snip lots of repeats>

        movq    %rdi, %rcx
        shrq    $59, %rcx
        jne     .LBB0_16
        sarq    $56, %rdi
        movl    .Lswitch.table.check_pat(,%rdi,4), %eax
.LBB0_16:
        retq
.Lswitch.table.check_pat:
        .long   1                               # 0x1
        .long   1                               # 0x1
        .long   0                               # 0x0
        .long   0                               # 0x0
        .long   1                               # 0x1
        .long   1                               # 0x1
        .long   1                               # 0x1
        .long   1                               # 0x1
```
which has a number of issues
1. The loop is unrolled (again, more on this later)
2. Each iteration of the loop (and therefore duplicated 8 times), the same constant is reloaded into `%edx` despite the register not being clobbered
3. The return value for the function is a 0 from `xor %eax` in the first instruction, or picked up as a 1 from the `.Lswitch.table.check_pat:`.  I don't even know what to call this transformation, but it would be *far* better replaced with `bt` as per earlier iterations, and a single `setc %al` to drop the memory load and 32 byte(!) table.

This loop should not be unrolled at any optimisation level.  It's a fixed number of iterations with a simple induction variable, so can be predicted perfectly on even ~10yo hardware.  The loop carry dependency is trivial, as it's data shifted out of `val` one byte at a time, and there's no latency-sensitive work which can be shuffled earlier.  Furthermore, decode bandwidth which would be decoding beyond the loop is wasted re-decoding the same uops which could be served from the uop cache.

Genuinely, the `-O1` code generation is far preferable to anything that higher optimisation levels spit out, in terms of binary size, runtime speed, and power utilisation.  (This example is too small, but longer loops which fit in the uop cache will allow power savings.)
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzFWFtz2zYT_TXyC8YakaJk-0EPviRtpsn0m695z4AkKMKGABYAdcmv71mApClZcjOdXjiyLJLYxdmzu8AuclMeVr8Zaw-sMpb5WrC10MLKgnnplZikjyxvPfvECq4n6Y3HEKlfmKkYZ7nwXli88WJtrHTcS6PZJL010EPKuGaNsDVvHJ6mhSk77XGgMyznJfsEtSVT8kWwzYFtjBYHPC9eIMEm83u3kUoc8GOS3k0ns6fJ7D5-f2yVYmLPNw1gzu9Z7X3jwriP-KxNmRvlp8aucfcdf19-KW6-iHqxHyv5WkvH8OHMCq5YI0UhyDhl1qCAKOFtKb3Ua7ZuhfOsbUqY65g3QWYtHVEAm_e3Sxji2Jff_v_tf_dfmWtEIStZcKUOgE5E7mpZ1HE2p8yu4R53mnnLaQqjAWArrW-56rjs2ees4tIq-Ig7H8QImQYeUZ6ITBn7SW6FJiKinctZ_BTdfTqXulAtfDGZPzpfSu2n9WT-YUwLnrGiFsXLN8wG41o8WGbfPNtyRcbEoTcP8QfD1Won1xpwSFRO5g9jdTSAEEMTk5j2ic0wIvx8ZLfh5yR9CJ9HmoIRHPo8sVs2zEdajuaky-2kB6mkOQimSzbbV9Wx1FnJgjtBOO7PPE7OP87OP16cf7w8__jmzePwymhEWSvOMEdXKSreKn9W1ArfWh0oPTX5aczc06niTjB5nXMY0wfNWIQWh9Mc5p5h1PWvCb6RNObFsQZq_YGtjSkpGB8f2YZrvkbOOCXXtUcYdwsHnJYfWGlN01CCcR3XDVAx5MNzu2lCjmBuqG9YjgULvg1pIX3IOGSTpEUgDMCUUPvFWKQxiSHbFBLWjlaPY9tew3x-xOrbC5nDJtnsVeCI7I3Zqjgqu6UwnqQLwffHY55hDF3Tzw8Ps29JfBlvsj-dPgJgxMZnGErjfxa8hGnzp6COPYnGI5OfkpMUqe3vJ8hsKU_jq-jQn0Et2Aj1Yow6-UHUAIV8DovtJ43gCRYM8O8vIVe0JuO6zuDTDnZcSglncYKz2DR5Z2fWjSnUiSn52JTsonzSyZfyjQJxWcHe2DGJF4IA6fH7mMO3q8colJI_13I2V382O7GlqH9k76VtSmkLp5gWm9BamNa93Tf-3jT5MY5Grri56Ao-9mWyfMPh91yxXvJCyIyYTrN5P6o8GZV7xXrMxf7SqGcu3sfTJWGM4piGp3jOpeo7YX5zKcx_lJriP2RmKEdCHaJlg_XbOyq-rGgE9-6kJvnLTC7uLg161u-5zPFXFctLC2fP0_RzrESmnudKTMfFE0QHrNlo_eqDfpj9dB0YrxQXtJ-KTJXBTspYwt6_iPbZPvkHhGc_JDz7B4T_O5v_PeGTVTl2FDWnIki3mxw7K_JHOodeJY5IpuxrXzthoW-1NUqhTEdc8jWXocXYnKuWong6ZR84NS2-3zqgfyjGSIku6d6KipSUbaMk9YMlCncvN0AR451EHN_QTqSd59QjOMS3MqgAQs9gaEPqFg_sSaVwDeYMckOXhfoQpSOVioUyOYwVZUQ5j0Z29Sz6gFYM7WzV6iIAD33XjFXWbGiufehHQhbSHhhLzEpa9HgSEG1b9B0YBjayeAHOtmGB6iRqIQkIv5eceMvQPJcmNs_YkzV70WaHXhB7MKym_jASjx1YO6DeHLV-0rOdaVUJwwH3vuLwzX1fQGOhVLwAMMxfE5TckzHAiLabCW6VxP_Bd46UksOoYtZrFdA74QsiAo0dRIGI6vFg2kYgMA6MfBSk5inLD16EFS2hGjza-6adDqHh6oA6uuw17GA01wdmGgRHf2CgwIoimvpyvpJ7DB3F82BANHQo-KUuo5vgcysJDVnoiFVN06IVKWVB0Qg-KlFQ74HBwQuTmw_J7GCQPLbccSsAYEiUgtOBSIltSJdCFwcW3CO3kof9Evx2rUfJPcDUsqI5UEQRWtC4jWQabC_EWLA6pEPvgJAxQYM2IeMwybUT6KE9mnd43L50pwWdJa5uq4oI7JwKtB9bS2ooeUktKngq8nKo38kSLEX5IXjCe8qdXBxMhDCsCjseThKsuB5GDQnbmsb1WHpdTtgtxg9J0AbOEPhHwfCT0K3Ugo4_Hvtc6TrF03oUGBDZ5K8Kj-DHcL6iD3TYRFhAYI3ekcLhTeSgr8RSQezTPJTIICXUEbnUHH508ntgyLaafEAnM1g4Ok80KJQta70cHaAgwkMgd4dLwfvGMLdBrvZ5SYs3BInBnp5K-n4dGQhBwCK_IYeUj1M5voVNbooEuipX8_JufsevwlnbipRdx1SJ_bCX10f2ggbU7FetVauT0y6kRZtPC7PBjVLb_t91Y80z4h633aaQflxkyyy7qld3WVYt78q7slgkvCzyNKluk2VR3t5lVS4Wt1eK52B3NVk8TNJUi13cV_B7sni6kqt0lqaz-XyWZPMsm01v04KLBV-IeTnPlnyBFkBsuFRTwkHHcFd2FSDl7drhJfj27vUld-H8SITpoJ-3vjZ2BRcd6rq5ClOvAvQ_ABjO1Xo">