<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/54644>54644</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            loop-unrolling anti-optimisation at -O2

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          andyhhp

      </td>

    </tr>

</table>

<pre>

    Sorry for the generic title, but I can't think of a better categorisation (other than perhaps "code generation so bad I'd like my money back" :smiley:).

Full example: https://godbolt.org/z/MKc7Meh5x

This is a real piece of logic for auditing guest updates to a register (x86's MSR_PAT specifically), which is a slowpath in traditional virtualisation, but a fairly fastpath for nested virtualisation.  Given:

```c

#include <stdint.h>

int check_pat(uint64_t val)

{

    unsigned int i;

    for ( i = 0; i < 8; i++, val >>= 8 )

    {

        switch ( val & 0xff )

        {

        case 0:

        case 1:

        case 4:

        case 5:

        case 6:

        case 7:

            continue;

        default:

            return 0;

        }

    }

    return 1;

}

```

the code generation at `-O1` looks pretty good.  GCC manages slightly better (by dropping another conditional jump in the loop body) but it's a simple loop.  (More on this later).

```

check_pat:                              # @check_pat

        movl    $8, %eax

        jmp     .LBB0_1

.LBB0_4:                                #   in Loop: Header=BB0_1 Depth=1

        shrq    $8, %rdi

        decl    %eax

        je      .LBB0_5

.LBB0_1:                                # =>This Inner Loop Header: Depth=1

        leal    -4(%rdi), %ecx

        cmpb    $4, %cl

        jb      .LBB0_4

        cmpb    $1, %dil

        jbe     .LBB0_4

        xorl    %eax, %eax

        retq

.LBB0_5:

        movl    $1, %eax

        retq

```

However, the code generation at `-O2` is outrageous:

```

check_pat:                              # @check_pat

        xorl    %eax, %eax

        cmpb    $7, %dil

        ja      .LBB0_16

        movzbl  %dil, %ecx

        movl    $243, %edx

        btl     %ecx, %edx

        jae     .LBB0_16

        movq    %rdi, %rcx

        shrq    $8, %rcx

        cmpb    $7, %cl

        ja      .LBB0_16

        movzbl  %cl, %ecx

        movl    $243, %edx

        btl     %ecx, %edx

        jae     .LBB0_16

# <snip lots of repeats>

        movq    %rdi, %rcx

        shrq    $59, %rcx

        jne     .LBB0_16

        sarq    $56, %rdi

        movl    .Lswitch.table.check_pat(,%rdi,4), %eax

.LBB0_16:

        retq

.Lswitch.table.check_pat:

        .long   1                               # 0x1

        .long   1                               # 0x1

        .long   0                               # 0x0

        .long   0                               # 0x0

        .long   1                               # 0x1

        .long   1                               # 0x1

        .long   1                               # 0x1

        .long   1                               # 0x1

```

which has a number of issues

1. The loop is unrolled (again, more on this later)

2. Each iteration of the loop (and therefore duplicated 8 times), the same constant is reloaded into `%edx` despite the register not being clobbered

3. The return value for the function is a 0 from `xor %eax` in the first instruction, or picked up as a 1 from the `.Lswitch.table.check_pat:`.  I don't even know what to call this transformation, but it would be *far* better replaced with `bt` as per earlier iterations, and a single `setc %al` to drop the memory load and 32 byte(!) table.

This loop should not be unrolled at any optimisation level.  It's a fixed number of iterations with a simple induction variable, so can be predicted perfectly on even ~10yo hardware.  The loop carry dependency is trivial, as it's data shifted out of `val` one byte at a time, and there's no latency-sensitive work which can be shuffled earlier.  Furthermore, decode bandwidth which would be decoding beyond the loop is wasted re-decoding the same uops which could be served from the uop cache.

Genuinely, the `-O1` code generation is far preferable to anything that higher optimisation levels spit out, in terms of binary size, runtime speed, and power utilisation.  (This example is too small, but longer loops which fit in the uop cache will allow power savings.)

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzFWFtz2zYT_TXyC8YakaJk-0EPviRtpsn0m695z4AkKMKGABYAdcmv71mApClZcjOdXjiyLJLYxdmzu8AuclMeVr8Zaw-sMpb5WrC10MLKgnnplZikjyxvPfvECq4n6Y3HEKlfmKkYZ7nwXli88WJtrHTcS6PZJL010EPKuGaNsDVvHJ6mhSk77XGgMyznJfsEtSVT8kWwzYFtjBYHPC9eIMEm83u3kUoc8GOS3k0ns6fJ7D5-f2yVYmLPNw1gzu9Z7X3jwriP-KxNmRvlp8aucfcdf19-KW6-iHqxHyv5WkvH8OHMCq5YI0UhyDhl1qCAKOFtKb3Ua7ZuhfOsbUqY65g3QWYtHVEAm_e3Sxji2Jff_v_tf_dfmWtEIStZcKUOgE5E7mpZ1HE2p8yu4R53mnnLaQqjAWArrW-56rjs2ees4tIq-Ig7H8QImQYeUZ6ITBn7SW6FJiKinctZ_BTdfTqXulAtfDGZPzpfSu2n9WT-YUwLnrGiFsXLN8wG41o8WGbfPNtyRcbEoTcP8QfD1Won1xpwSFRO5g9jdTSAEEMTk5j2ic0wIvx8ZLfh5yR9CJ9HmoIRHPo8sVs2zEdajuaky-2kB6mkOQimSzbbV9Wx1FnJgjtBOO7PPE7OP87OP16cf7w8__jmzePwymhEWSvOMEdXKSreKn9W1ArfWh0oPTX5aczc06niTjB5nXMY0wfNWIQWh9Mc5p5h1PWvCb6RNObFsQZq_YGtjSkpGB8f2YZrvkbOOCXXtUcYdwsHnJYfWGlN01CCcR3XDVAx5MNzu2lCjmBuqG9YjgULvg1pIX3IOGSTpEUgDMCUUPvFWKQxiSHbFBLWjlaPY9tew3x-xOrbC5nDJtnsVeCI7I3Zqjgqu6UwnqQLwffHY55hDF3Tzw8Ps29JfBlvsj-dPgJgxMZnGErjfxa8hGnzp6COPYnGI5OfkpMUqe3vJ8hsKU_jq-jQn0Et2Aj1Yow6-UHUAIV8DovtJ43gCRYM8O8vIVe0JuO6zuDTDnZcSglncYKz2DR5Z2fWjSnUiSn52JTsonzSyZfyjQJxWcHe2DGJF4IA6fH7mMO3q8colJI_13I2V382O7GlqH9k76VtSmkLp5gWm9BamNa93Tf-3jT5MY5Grri56Ao-9mWyfMPh91yxXvJCyIyYTrN5P6o8GZV7xXrMxf7SqGcu3sfTJWGM4piGp3jOpeo7YX5zKcx_lJriP2RmKEdCHaJlg_XbOyq-rGgE9-6kJvnLTC7uLg161u-5zPFXFctLC2fP0_RzrESmnudKTMfFE0QHrNlo_eqDfpj9dB0YrxQXtJ-KTJXBTspYwt6_iPbZPvkHhGc_JDz7B4T_O5v_PeGTVTl2FDWnIki3mxw7K_JHOodeJY5IpuxrXzthoW-1NUqhTEdc8jWXocXYnKuWong6ZR84NS2-3zqgfyjGSIku6d6KipSUbaMk9YMlCncvN0AR451EHN_QTqSd59QjOMS3MqgAQs9gaEPqFg_sSaVwDeYMckOXhfoQpSOVioUyOYwVZUQ5j0Z29Sz6gFYM7WzV6iIAD33XjFXWbGiufehHQhbSHhhLzEpa9HgSEG1b9B0YBjayeAHOtmGB6iRqIQkIv5eceMvQPJcmNs_YkzV70WaHXhB7MKym_jASjx1YO6DeHLV-0rOdaVUJwwH3vuLwzX1fQGOhVLwAMMxfE5TckzHAiLabCW6VxP_Bd46UksOoYtZrFdA74QsiAo0dRIGI6vFg2kYgMA6MfBSk5inLD16EFS2hGjza-6adDqHh6oA6uuw17GA01wdmGgRHf2CgwIoimvpyvpJ7DB3F82BANHQo-KUuo5vgcysJDVnoiFVN06IVKWVB0Qg-KlFQ74HBwQuTmw_J7GCQPLbccSsAYEiUgtOBSIltSJdCFwcW3CO3kof9Evx2rUfJPcDUsqI5UEQRWtC4jWQabC_EWLA6pEPvgJAxQYM2IeMwybUT6KE9mnd43L50pwWdJa5uq4oI7JwKtB9bS2ooeUktKngq8nKo38kSLEX5IXjCe8qdXBxMhDCsCjseThKsuB5GDQnbmsb1WHpdTtgtxg9J0AbOEPhHwfCT0K3Ugo4_Hvtc6TrF03oUGBDZ5K8Kj-DHcL6iD3TYRFhAYI3ekcLhTeSgr8RSQezTPJTIICXUEbnUHH508ntgyLaafEAnM1g4Ok80KJQta70cHaAgwkMgd4dLwfvGMLdBrvZ5SYs3BInBnp5K-n4dGQhBwCK_IYeUj1M5voVNbooEuipX8_JufsevwlnbipRdx1SJ_bCX10f2ggbU7FetVauT0y6kRZtPC7PBjVLb_t91Y80z4h633aaQflxkyyy7qld3WVYt78q7slgkvCzyNKluk2VR3t5lVS4Wt1eK52B3NVk8TNJUi13cV_B7sni6kqt0lqaz-XyWZPMsm01v04KLBV-IeTnPlnyBFkBsuFRTwkHHcFd2FSDl7drhJfj27vUld-H8SITpoJ-3vjZ2BRcd6rq5ClOvAvQ_ABjO1Xo">