<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/59243>59243</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            LLVM/Clang miscompiles GNU C generic vector code when asked to do 512-bit operations
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          ryao
      </td>
    </tr>
</table>

<pre>
    A while back, Intel invented a technique for doing fast fletcher4 calculations via 4x parallel accumulator streams and used it to do a fast AVX2 implementation of fletcher4:

https://www.intel.com/content/www/us/en/developer/articles/technical/fast-computation-of-fletcher-checksums.html

Later, someone extended it to perform 8 independent accumulator streams:

https://github.com/openzfs/zfs/commit/70b258fc962fd40673b9a47574cb83d8438e7d94

I found that I could obtain the same assembly that Intel wrote by compiling generic version written in GNU C:

https://gcc.godbolt.org/z/e3Kf4TcPo

In specific, I get the following output from Clang for the loop, which is the same as what Intel wrote:

```
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpmovzxdq       ymm4, xmmword ptr [rsi]
        vpaddq  ymm0, ymm4, ymm0
        vpaddq  ymm1, ymm0, ymm1
        vpaddq  ymm2, ymm1, ymm2
        vpaddq  ymm3, ymm2, ymm3
        add     rsi, 16
        cmp     rsi, rdx
        jb      .LBB0_2
```

I then was curious about the performance of the 8 stream version, so I modified my GNU C code to do 8 independent accumulator streams:

https://gcc.godbolt.org/z/vhvesMhdh

That produced the following for the loop:

```
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpmovzxdq       ymm4, xmmword ptr [rsi]
        vpaddq  ymm0, ymm0, ymm4
        vpaddq  ymm1, ymm0, ymm1
        vpaddq  ymm2, ymm1, ymm2
        vpaddq  ymm3, ymm2, ymm3
        add     rsi, 32
        cmp     rsi, rdx
        jb      .LBB0_2
```

If it were not for the `add     rsi, 32`, I would have been certain that I had compiled the wrong code. We should have done another vpmovzxdq and 4 more vpaddq instructions to process the additional 256-bits, but those instructions are missing. There are other issues with the generated assembly too (visible at godbolt). In particular, it treats the initial loads and stores as if they operate on 512-bits.

Interestingly, GCC has the same behavior where it emits 256-bit SIMD instructions as if they were 512-bit SIMD instructions. ICC on the other hand will refuse to compile the 512-bit version, claiming that vector operations are not supported with these operand types when it sees the `__builtin_convertvector()`.

I had expected clang to just emit more AVX2 instructions to handle the additional width.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzdVstu4zYU_Rp5Q8SwJdmRF15kknYaNFMUaDrtLqDIK4szlKgRKSuer-8hKTt25tFFUaCoYYgSeXmf5z5KIw_bGzbWShMrufiYpLfsvnWkmWr3hBfJOHMk6lZ9GohVpmfSqHbHKm4dqzQ5UVOfM8G1GDR3yrSW7RVn-TPreM-1BisuxND4U9y2rifeWMZbyQYL9soxZ8AUcgLPm_d_pkw1naYG8gNHZqoXUUl2kyzuksX0rJ3rrN9Lf8R_HMe58urPhWnwLQw-WhdP8BwsHtTiIWlP2nTU4533TglN_iyaCmvw7tW5Ap9uiGpcmerqqMYVHuKjHRo7r12jzzV64M5zvWXWNGRaYvQMHeTJVMiEGxtWwMWSOn_Uuq_56LuW7pSrh3IyE3a0nyuvf3xit1He6utFma6KSmzWaSXzxfo6Kzc8v15d56IsMlnkWUHXcpOfy7lHlAdEx9XcsXsmzKAlM6XjqsUeMcsbYtxaakp9mKgCYsbeOKDowLzPlPYo2VFLvRJsT731gRx75eAMWM7e_vI7u_2-iULMd0aWRru56XfeOh-97OcqfxS_mgudW2Y7EqpSIiAYgl1QtjJam9GrYgaHQLKqNw271dxDGK72NNqYzt9CFoiaKXtuJTYv7XulcbJeTP_wOX9482bxlIKI_c0vSTOWZHdJ9sNjDZH3LRzFHqAJ-4m4BIDA4o46V4NoGZkfr-67xuw_P8tP0_ehaXKv_3PTjKaXrHM9S1Zv4PFkdff6Kpf-Hq4s_JXj1fD9LcrliSKuX6hzokxPFHFNv0mZnSjiml1Sgi6s3gYQLNeXx6Lpzo97-Xx5_qGM6zEcXwvVEewINnCJSIuhV2ZAZSoBlYCBKVN5K8iXIL9VTMl5RHTMcwCuMRLoQ5I3hwhtZIGkqbT9w1T_ah7s6z3Zd7Wszy8-erR2vZGDIPkqAS7g_n9D8Qua_7soztJ_DcWV7y0j9cRa406RBtWXKqwXsUKOoa7XfI-STUgBQf1U4kPZr7mc6vgEJFQ_gMiDes7-QH2sX-5L3-U4JKMxnsXVt_gciQGtJp-pFngfRJwSfCvsjSAbCy4IlD_gmqWr9VWpnPWKliEXjaXLyxxMG2UtgD1nj7W33G9FFbA_EEo3WmRgHboQD7PMqW0ZA_AWe2VVidEHJk8ZlqSbOXDshxcMBcjS0Mp950aiuqiqaqEp9NSGyzjJWOQyJKKKqFAnDsyPFhDJ0PRWyzSYM7_sWJgSyDoYoA9exNvbW3jzrPmUBOcqhHIM5kEFQlO3R--w3-7f3b1yyov4gIVJ8JeUsBDSTOzn0We1N2NUWrOeKgxmPjxT_APVkddZ3ROaq8ZXlgCZPQlf0KLdpxh5ONqh60zv3X-MCNgHOj9lHDofKl-Ewd0S2SN0n57KQWk46AlTHMS6KAFRQ4xwfunOAFh6xhDgBYnQ4GHCh8FGv0UcxuHyFQq96ZOVZyAclXT1fEbb5Xq9WW8WeVHM5DaTm2zDZ045TduHh_fvUInjNAE0Tv6yUwN4GX6CZ0I_CIZy-9EnVWgNR7---G029Hr7nWFP6_1xuUICfQB3fEbM42W1SfNsVm-zsljLbCkLXmyQ4PmquqacllW2qZaULvOZ5iVpu0WNTdK0pTGmDd5Rb2dqmy7SdLlMN4t0lefreVZWFV_yVSFXG7FMV0m-oIYrPfd6-LY067dBpXLYWRxqZYH40yEST-1aoiAO_PmApO63_YGbWZC7DXr_BZFz9pc">