<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/59243>59243</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
LLVM/Clang miscompiles GNU C generic vector code when asked to do 512-bit operations
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
ryao
</td>
</tr>
</table>
<pre>
A while back, Intel invented a technique for doing fast fletcher4 calculations via 4x parallel accumulator streams and used it to do a fast AVX2 implementation of fletcher4:
https://www.intel.com/content/www/us/en/developer/articles/technical/fast-computation-of-fletcher-checksums.html
Later, someone extended it to perform 8 independent accumulator streams:
https://github.com/openzfs/zfs/commit/70b258fc962fd40673b9a47574cb83d8438e7d94
I found that I could obtain the same assembly that Intel wrote by compiling generic version written in GNU C:
https://gcc.godbolt.org/z/e3Kf4TcPo
In specific, I get the following output from Clang for the loop, which is the same as what Intel wrote:
```
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpmovzxdq ymm4, xmmword ptr [rsi]
vpaddq ymm0, ymm4, ymm0
vpaddq ymm1, ymm0, ymm1
vpaddq ymm2, ymm1, ymm2
vpaddq ymm3, ymm2, ymm3
add rsi, 16
cmp rsi, rdx
jb .LBB0_2
```
I then was curious about the performance of the 8 stream version, so I modified my GNU C code to do 8 independent accumulator streams:
https://gcc.godbolt.org/z/vhvesMhdh
That produced the following for the loop:
```
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpmovzxdq ymm4, xmmword ptr [rsi]
vpaddq ymm0, ymm0, ymm4
vpaddq ymm1, ymm0, ymm1
vpaddq ymm2, ymm1, ymm2
vpaddq ymm3, ymm2, ymm3
add rsi, 32
cmp rsi, rdx
jb .LBB0_2
```
If it were not for the `add rsi, 32`, I would have been certain that I had compiled the wrong code. We should have done another vpmovzxdq and 4 more vpaddq instructions to process the additional 256-bits, but those instructions are missing. There are other issues with the generated assembly too (visible at godbolt). In particular, it treats the initial loads and stores as if they operate on 512-bits.
Interestingly, GCC has the same behavior where it emits 256-bit SIMD instructions as if they were 512-bit SIMD instructions. ICC on the other hand will refuse to compile the 512-bit version, claiming that vector operations are not supported with these operand types when it sees the `__builtin_convertvector()`.
I had expected clang to just emit more AVX2 instructions to handle the additional width.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzdVstu4zYU_Rp5Q8SwJdmRF15kknYaNFMUaDrtLqDIK4szlKgRKSuer-8hKTt25tFFUaCoYYgSeXmf5z5KIw_bGzbWShMrufiYpLfsvnWkmWr3hBfJOHMk6lZ9GohVpmfSqHbHKm4dqzQ5UVOfM8G1GDR3yrSW7RVn-TPreM-1BisuxND4U9y2rifeWMZbyQYL9soxZ8AUcgLPm_d_pkw1naYG8gNHZqoXUUl2kyzuksX0rJ3rrN9Lf8R_HMe58urPhWnwLQw-WhdP8BwsHtTiIWlP2nTU4533TglN_iyaCmvw7tW5Ap9uiGpcmerqqMYVHuKjHRo7r12jzzV64M5zvWXWNGRaYvQMHeTJVMiEGxtWwMWSOn_Uuq_56LuW7pSrh3IyE3a0nyuvf3xit1He6utFma6KSmzWaSXzxfo6Kzc8v15d56IsMlnkWUHXcpOfy7lHlAdEx9XcsXsmzKAlM6XjqsUeMcsbYtxaakp9mKgCYsbeOKDowLzPlPYo2VFLvRJsT731gRx75eAMWM7e_vI7u_2-iULMd0aWRru56XfeOh-97OcqfxS_mgudW2Y7EqpSIiAYgl1QtjJam9GrYgaHQLKqNw271dxDGK72NNqYzt9CFoiaKXtuJTYv7XulcbJeTP_wOX9482bxlIKI_c0vSTOWZHdJ9sNjDZH3LRzFHqAJ-4m4BIDA4o46V4NoGZkfr-67xuw_P8tP0_ehaXKv_3PTjKaXrHM9S1Zv4PFkdff6Kpf-Hq4s_JXj1fD9LcrliSKuX6hzokxPFHFNv0mZnSjiml1Sgi6s3gYQLNeXx6Lpzo97-Xx5_qGM6zEcXwvVEewINnCJSIuhV2ZAZSoBlYCBKVN5K8iXIL9VTMl5RHTMcwCuMRLoQ5I3hwhtZIGkqbT9w1T_ah7s6z3Zd7Wszy8-erR2vZGDIPkqAS7g_n9D8Qua_7soztJ_DcWV7y0j9cRa406RBtWXKqwXsUKOoa7XfI-STUgBQf1U4kPZr7mc6vgEJFQ_gMiDes7-QH2sX-5L3-U4JKMxnsXVt_gciQGtJp-pFngfRJwSfCvsjSAbCy4IlD_gmqWr9VWpnPWKliEXjaXLyxxMG2UtgD1nj7W33G9FFbA_EEo3WmRgHboQD7PMqW0ZA_AWe2VVidEHJk8ZlqSbOXDshxcMBcjS0Mp950aiuqiqaqEp9NSGyzjJWOQyJKKKqFAnDsyPFhDJ0PRWyzSYM7_sWJgSyDoYoA9exNvbW3jzrPmUBOcqhHIM5kEFQlO3R--w3-7f3b1yyov4gIVJ8JeUsBDSTOzn0We1N2NUWrOeKgxmPjxT_APVkddZ3ROaq8ZXlgCZPQlf0KLdpxh5ONqh60zv3X-MCNgHOj9lHDofKl-Ewd0S2SN0n57KQWk46AlTHMS6KAFRQ4xwfunOAFh6xhDgBYnQ4GHCh8FGv0UcxuHyFQq96ZOVZyAclXT1fEbb5Xq9WW8WeVHM5DaTm2zDZ045TduHh_fvUInjNAE0Tv6yUwN4GX6CZ0I_CIZy-9EnVWgNR7---G029Hr7nWFP6_1xuUICfQB3fEbM42W1SfNsVm-zsljLbCkLXmyQ4PmquqacllW2qZaULvOZ5iVpu0WNTdK0pTGmDd5Rb2dqmy7SdLlMN4t0lefreVZWFV_yVSFXG7FMV0m-oIYrPfd6-LY067dBpXLYWRxqZYH40yEST-1aoiAO_PmApO63_YGbWZC7DXr_BZFz9pc">