<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/56498>56498</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Missed optimization with movzx and mov
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Hintro
</td>
</tr>
</table>
<pre>
I came across this when examining a loop that runs slower than I expected. It involves explicit and implicit conversions between 8-bit and 32/64-bit values, and as I looked through the generated assembly using Godbolt compiler explorer, I found lots of movzx instructions that don't seem to break dependency or play a role in correctness, not to mention many use the same register like ```movzx eax al```, which cannot be eliminated.
I then tried some simple examples on Godbolt, and found that this behavior is persistent and easily reproducible, even when I specify `-march=skylake`. Here's an example:
```
#include <stdint.h>
int add2bytes(uint8_t* a, uint8_t* b) {
return uint8_t(*a + *b);
}
```
Clang 14 `-O3`
```lang-text
add2bytes(unsigned char*, unsigned char*): # @add2bytes(unsigned char*, unsigned char*)
mov al, byte ptr [rsi]
add al, byte ptr [rdi]
movzx eax, al
ret
```
`movzx` would be better in place of the `mov` instead of being at the end, so that dependency on old RAX value can be broken from the start and also clearing the upper bits of RAX in the process.
Here's another example that's closer to what I was originally examining:
```
int foo(uint8_t* a, uint8_t i, uint8_t j) {
return a[a[i] | a[j]];
}
```
Clang 14 `-O3`:
```
foo(unsigned char*, unsigned char, unsigned char): # @foo(unsigned char*, unsigned char, unsigned char)
mov eax, esi
mov ecx, edx
mov cl, byte ptr [rdi + rcx]
or cl, byte ptr [rdi + rax]
movzx eax, cl
movzx eax, byte ptr [rdi + rax]
ret
```
`movzx eax, cl` here just seems unnecessary. The upper bits of RCX have already been cleared by `mov ecx, edx`, and their cleanness is also seen in RCX being used as index in `mov cl, byte ptr [rdi + rcx]`. The subsequent`or` does not affect its upper bits, and the dependency of RCX on this `or` is not something that `movzx eax cl` can break. So I think it's better to just do `movzx eax, byte ptr [rdi + rcx]` after the `or`. Or maybe even better, just use ` mov eax, byte ptr [rdi + rcx]` since eax should already be free and clean in upper bits after the beginning `mov eax, esi`.
I also asked this on Stack Overflow and [Peter Cordes] has a great response: https://stackoverflow.com/a/72953035/14730360
explaining how this behavior is bad for pretty much all X86 processors
Godbolt link with examples: https://godbolt.org/z/z45xr4hq1
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJylV01z4jgQ_TXm0hWKmO8Dh0yy2clha7Z29zC3LdlusIKQGEmGML9-X8sQIJDJ1CwFsS21u5-6-3V3ClftZk9UqhWTKr0LgWKtA21rtsQvaqWttgtSZJxbY0tF8o0NFIzbspcFS08QXHMZuerSUyRtN85sOMiq0aWOpGxFerV_KJ3dsA_aQUvBccswNLkp9mL9PMsfR4P0vFGm4ZDl92lHBRgCiiVXMOtds6hxZVqwZa9gHBKBV4XZURME8--uKpwRg6u1NgAreJxnLxqfaO4aaDUuBnJzWrnN9xdAD9E3ZUzg0mErZ7N8HCkwryg6KjyrJVW8ZluxLXfkPK2N2sFD3hmGBtjzHt6wHBJ266K8uGIrammlrADkhD2I2z0vdIjAZ_SSKRv12m-LiNULKfO6KAq3tS5rRMyK5oKJjUaUxAPdrPeQ9e7av09iwVL0Gq4JDoaCxIBTVHHFse3BRwcftz5JB09ZUHCtNhpHxP1aggactg0Uq6Dhas9r76qm1IVh0cIb2EzJ80QBSaHnOznSzUr5ss76D2G5M2rJWOrSZ_Z4Zxyg7wAq6-_RH0_cPuZ9bUvTVHBQ_z7EStvYhb7fTk-sBVpV5cUuStpMGixM_sXh7kgJtpPnIsunlI0_tS8SPp5j4-1RZgIxRVn-Cb87Ec_6e-ls_HAV471RyLrbQTrvl_4R-0FM9m8iv8R2_QyqDXphEaiyVsjPuwT3Yg0YEtYrHziIskHvV3S-ukA-yLp0Rc5BXFTROnrKhp8Q_Wz4cC4Nc-9JV5fSbUaT5HRKOHO-jQhcj_2eC7jS1jWmkqRH5RDKgG5gX8lCYSFUKyuSQmVWlWwUnEpYTBLgrRgPbs_vEyZbclD-193XtvIIxZIpj5pjae7dqiVtVL7lgDJQUxpWXgzIXrMGTQjFKxUVUQWEsgGWlCgIZww9yX8HGX8gQUKWNkrjglRZB0oB7BNtUQWd1wvw3YB9rwX6Pd4IJebOvU8G0qcPz-_RQiGk8pOgQuA-LTxLhPH9BWK8A3cP9ePEvVj4ATXOCPJ_LFwlyj6ZGey4vl-2-9XL9f3yGnVS3fF48y2FUIo_ektdvvWGeKX54fbPav2Ir2cGwUhkONNzE9pmGuBfy8IJ5Xdd-ueSPPdfCd0HY4lBz612oCJYmNiGmBS7PdffOrntkSq1MdY-vWClF0sPS4QNoge0FANtaUA7TvOFRiWQIeBU9YfxkU4m6ENTBP7WoD9iyXk5ceXQZaVLq_kcEwHJyY6HPMF5VobaozvbtuBXZbpVJZ0cG6neoCScDQqtn1PZkjGlS387kjFA2yVsp4qyr5uoKCkSlaPLcP3wsDhLUtAW2wStS1885pqdzCLS_lsboinZkGFHXmw9-lM2ML-hpsuRQp0q_jEHUIeZk99SZCVaJ3lzBFdgqrJpdt0H84SogHw-KaW8UKEdLXWajP6OqlzSF4yqc4y6ySCg_smi_975Ci0WpbBG1ihaABzmYg5rzI0yw1Ad4zpIlcsf8Q2iy-1VdTGPYg1F-HGcT4f9Xn-I29vBGHcHCsmkqtrJu4bxi1msUDKrYfIEB-OOVg0GQjQE-joZHTqN88idk0MepmEjybDVsX6dAy_xLlrZrvMLPH2X32D44gf1t9tONetX0_5UdaKOhmd_6CDkceuIIfS7SiNu0t7mlLgNd53Gm9kbGxBqir0zjNkcLjfA_wyy4BGq0_z_OBwNppNOPRsP88Gc54NyPLntjcaqVFz2xqPelIuiX_QmHaMKNmGGQGV5bnlLSQXuEauOnuW9PO-Nb_FnMBmOuoNpPhqV02JawvG3vQrtgVdKm67gkMN3_CxBKppFwKbB-BuOm_h3Q_oDJ3PQr5pYOz_7jF7qXSdZniXk_wH2iCzK">