<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/62948>62948</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Emitting byte operations is extremely expensive on Zen 3

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          ryao

      </td>

    </tr>

</table>

<pre>

    Lately, I have been trying to micro-optimize a binary search function that operates on 4KB sized arrays. Doing some experiments with loop unrolling yielded three alternative implementations.

https://gcc.godbolt.org/z/79rq7sqcf

They all should perform similarly, yet on my Ryzen 7 5800X, they do not:

```

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077567, density: 10

Even distribution with 1024 32 bit integers, random access

| Name |      Items |       Hits |     Misses |       Time |

| ---------- | ---------- | ---------- | ---------- | ---------- |

| custom_binary_search_unroll_switch_v1 |       1000 |        980 |       9020 |   0.000220 |

|              custom_binary_search_unroll_switch_v2 | 1000 |        980 |       9020 |   0.000564 |

| custom_binary_search_unroll_switch_v3 |       1000 |        980 |       9020 |   0.000567 |

```

I apply the following patch to the assembly code. Then assemble and link the result:

```

diff --git a/out.s b/out.s

index f42f44c..24d6940 100644

--- a/out.s

+++ b/out.s

@@ -346,8 +346,8 @@ custom_binary_search_unroll_switch_v2:  # @custom_binary_search_unroll_switch_v2

        bsrl    %eax, %r8d

 movl    %r8d, %eax

        xorl    $31, %eax

-       movb    $30, %cl

-       subb    %al, %cl

+       movl    $30, %ecx

+       subl %eax, %ecx

        movl    $1, %eax

        shll    %cl, %eax

 leal    -1(%rax), %ecx

@@ -481,8 +481,8 @@ custom_binary_search_unroll_switch_v3:  # @custom_binary_search_unroll_switch_v3

        bsrl    %eax, %eax

 movl    %eax, %r8d

        xorl    $31, %r8d

-       movb    $30, %cl

-       subb    %r8b, %cl

+       movl    $30, %ecx

+       subl %r8d, %ecx

        movl    $1, %r8d

        shll    %cl, %r8d

 movl    %esi, %r9d

```

Now, when I rerun the micro-benchmark, all 3 perform similarly:

```

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077784, density: 10

Even distribution with 1024 32 bit integers, random access

| Name |      Items |       Hits |     Misses | Time |

|                                         ---------- | ---------- | ---------- | ---------- | ---------- |

| custom_binary_search_unroll_switch_v1 |       1000 |        945 |       9055 |   0.000224 |

|              custom_binary_search_unroll_switch_v2 | 1000 |        945 |       9055 |   0.000206 |

| custom_binary_search_unroll_switch_v3 |       1000 |        945 |       9055 |   0.000202 |

```

If I compile with GCC 12.2, I see an even better execution time on the last one:

```

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077837, density: 10

Even distribution with 1024 32 bit integers, random access

| Name |      Items |       Hits |     Misses | Time |

|                                         ---------- | ---------- | ---------- | ---------- | ---------- |

| custom_binary_search_unroll_switch_v1 |       1000 |        972 |       9028 |   0.000320 |

|              custom_binary_search_unroll_switch_v2 | 1000 |        972 |       9028 |   0.000340 |

| custom_binary_search_unroll_switch_v3 |       1000 |        972 |       9028 |   0.000178 |

```

That is unsuprising, considering that GCC emits fewer instructions for the switch statement body. Here is the first case statement for GCC:

```

.L54:

        leal    512(%rdx), %ecx

 cmpl    %esi, (%rdi,%rcx,4)

        cmovle  %ecx, %edx

```

And the first case statement from LLVM:

```

 leal    512(%rcx), %eax

        cmpl    %edx, (%rdi,%rax,4)

 cmovgl  %ecx, %eax

        movl    %eax, %ecx

```

In any case, movb, subb and presumably other byte operations are disproportionately expensive on Zen 3. Passing `-mtune=znver3` does not stop LLVM from using those instructions. Whatever optimization pass is opportunistically lowering operations to byte operations probably should be made to stop doing that on amd64. There is no way that performance will improve from doing that and there is clear evidence here that it is actually detrimental on a popular amd64 processor.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzcWF-P4ygS_zTOSymWg__EecjDdPf2butmV6fT6O50Ly0MlZhbDF7ASdyf_gQk6SQdpXv2ZrWjtaLYwI8qiqqCH1BrxVohLpPyLikfJnRwrTZLM1I9aTQfl5-pQzkm5B6eoKUbhAZRgTOjUGtwGjrBjJ7q3olOvCBQaISiZgSL1LAWVoNiTmgFrqUOdI-GOrSgFRR_uwMrXpADNYaONoUH7WVa3SHgrkcjOlTOwla4FqTWPQzKaCk9aBQoOXJwrUEEKh0aRZ3YIIiul-g7Uq_Wpkn2kGSf4n_rXG-T_FNCHhPyuGYsXWveaOlSbdYJeXxJyON8YX6b29_Y6rTjlxZHoFKCbfUgOfRoVtp0YEUnJDVxfkZ03q5uhH-ML6hgDmWdZf_2Tc735xqUdl79ieSkyva_ULxDxdqOml-T_FOclzBHvjTLSOFlmUHZWM6yUMYenQjGHqpDvUXkoaKqy2w-L6u5r-WorHBjRJ4O5IcNKuDCOiOaIXgszLvXCjmBRjgQyuEajQ1aqeK6A8oYWntm0PwefqEdgv8Iz5PDzr4W4SfhXos_C2vxtPWLiH1fpU2PD_yfxVeZbLBOd88xVp9jrD7H8Hq2W-FY-7yZnYzKz-pJERb1aXGRkUMxS7MsI7H4qu7s-YhuEsR9ndayKr7eyPx3G1lW8xN152Ec_5-A9r0cffTDSkuptz51e-pY61cOX02txa6RIzDNMYUvLapDHQJVHKRQvwakQTvId7KHi9UKptO1cEAT8qgHl1poDl8RIxTHHawKsioKlqak4NWiyLztVVFEiI8Wet4rIXfxdykuKbKkyGCaF1VC7mtIyN3xMzZ9yN0-HyEhue_0sQ5B-8FTjTXSvxNSIt35_ExIaWq-R3V6c2j2lbHZA8-E7PRBSJHPLkHTPajTm-YAyvYgJs8xdmj2mJLKC4yfwqMkeSkJ2e4SZodGntt1BMFbSW_GfcDYVh6mgMk3IIk0tE5nCan9LHllizeD2vu6qGcHXx8_P-7r_Gt9nX_A16-mnPj6Wijc8PUR9Dt9bermWzn7JEo_4Ow31l1z9tVsQCsOzQt-YyX7RW89buuXpycwaAYVFqVIfprjpk3uA0vIrxCEP2nbn9fFd77tv93wP_p8l8SgKM_2zLI8JwbFH0MMbmvNqm9MDN5RR94jBit4Aqa7XkiM8fbj_T3MSEriOcN6Tq8AfWw26BwawB2yGJ_Ox4uO-Sep9Zwb_6zsqvPvnVT_xbJrTs4ZaX0aePkfRbtvay2yb5xdN9XN5vU72fXFn7SFhUHZoTfCCrX2wcW0soKjCed2D_E5h52PmxVu0YBQ1pkhnNYtrLQJGRYHDNZRF07V0Gg-pvATGvQ6ArUXxjpg1OIJzPf_8f7-dmKmn8viiDiYf-Bi5YzsyRi_RsaAdf2bbTzCfcF_ME9_Ct_3TAHzBADhIG0vl-9uTOknxW-YanQHnz__8-fbxl4xjJ0adslYT-3ju2v20Qv7vGFreWnYpdyrBPGVi11drxVQNQa7Pd7zwrAYeu7nD2m9P5x11J_jtGvRQDM63F_2hGiiBv0y2Bvda-OrwqVSuOVRVmzCev4fVJCn8HdqfcRCUmXTzg1-bX94URs0eVJlwDVaUNqBdboPkx6nf7AxrLXFszhO4V8tdbhBA_srqjAi6Km1Pn517wc0KGGdYFTKEaTexhw5Gb7TbyzqjW6CwfsboQahoxw9NAyN62OeaQW041URjrcxbZSGLR1j854qUsX8XigliK43eoPRsBM5NAZhlMAkUgO4ERx9x1AdUCLkPmVuCOZwdPEejcowEOh1P0hq4pC8GX6v0Sad8GXOF_mCTnAZd7c6r4pJu5wVNWWUL2Z8Rec5ssW8RkqzhswXBWsIm4glyUielaTKytmCFGlJ6zqvqpzTGcmbqkiKDDsqZCrlpku1WU-EtQMuK7Io6omkDUob7h8JUbiF0JgQkpQPE7P0fabNsLZJkUlhnX2V4oSTuPyhE875Sbp0kbCAO2ewux5qk8HI5cWNoHDt0KRMdwl59Gr2r2lv9H-RuYQ8hsHZhDyGwf8vAAD__8Ka9W4">