<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/59937>59937</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Clang++: Bad code generation when extracting values from vector registers introduced by auto-vectorization
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          He3lixxx
      </td>
    </tr>
</table>

<pre>
    Clang++ currently generates suboptimal assembler code for extracting values out of a vector register if the usage of the vector register was caused by auto vectorization. [Godbolt with the examples below](https://godbolt.org/z/1EbscxvnE).

When compiling this code with `clang++ -std=c++20 -O3 -march=rocketlake`:
```cpp
#include <memory>

void compare_to_bytemask(const uint8_t* __restrict data_, uint8_t compare_value, uint8_t* __restrict output_) {
    auto* __restrict input = std::assume_aligned<16>(data_);
    auto* __restrict output = std::assume_aligned<16>(output_);

    for (size_t i = 0; i < 16; ++i) {
        output[i] = input[i] == compare_value ? 0xff : 0;
 }
}

auto extract(const uint8_t* data, uint8_t compare_value) {
 alignas(16) uint8_t matches[16];
    compare_to_bytemask(data, compare_value, matches);

    return *reinterpret_cast<uint64_t*>(matches);
}
```

clang currently correctly autovectorizes the `compare_to_bytemask` function, but extracting the lower half of the result in `test` is done using these instructions:
```
.LCPI1_0:
        .quad   255 # 0xff
        .quad   65280                           # 0xff00
 .quad   16711680                        # 0xff0000
        .quad 4278190080                      # 0xff000000
.LCPI1_1:
        .quad 1095216660480                   # 0xff00000000
        .quad 280375465082880                 # 0xff0000000000
        .quad 71776119061217280               # 0xff000000000000
        .quad -72057594037927936              # 0xff00000000000000
extract(unsigned char const*, unsigned char):                         # @extract(unsigned char const*, unsigned char)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpbroadcastb    xmm1, esi
        vpcmpeqb k1, xmm0, xmm1
        kshiftrb        k2, k1, 4
        vmovdqa64 ymm0 {k1} {z}, ymmword ptr [rip + .LCPI1_0]
        vporq   ymm0 {k2}, ymm0, ymmword ptr [rip + .LCPI1_1]
        vextracti128    xmm1, ymm0, 1
        vpor    xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpor    xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        vzeroupper
 ret
```

For the same code, GCC generates this assembly:
```
extract(unsigned char const*, unsigned char):
 vpbroadcastb    xmm0, esi
        vpcmpeqb        xmm0, xmm0, XMMWORD PTR [rdi]
        vmovq   rax, xmm0
        ret
```

May be relevant: When the value is extracted from a vector that was defined by the programmer using GCC vector extensions, the generated assembly is fine. This function:
```cpp
auto manual_vector(const uint8_t* __restrict data, uint8_t compare_value) {
    using vec_u8x16_t = __attribute((vector_size(16))) uint8_t;

    auto comparison_vector = compare_value - vec_u8x16_t{};
    auto comparison_result = *reinterpret_cast<const vec_u8x16_t*>(data) == comparison_vector; 

 return reinterpret_cast<uint64_t&>(comparison_result);
}
```

is compiled as
```
manual_vector(unsigned char const*, unsigned char): # @manual_vector(unsigned char const*, unsigned char)
 vpbroadcastb    xmm0, esi
        vpcmpeqb        xmm0, xmm0, xmmword ptr [rdi]
        vmovq   rax, xmm0
        ret
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysWFuP6jgS_jXmpdTIdsiFBx4aOMyutEczGo00-xY5SQHeTuK07dB0__qVnQTCrafnzEEIcqn66uJy1ZcIY-SuRlyQcEnC9US0dq_04l8YlPJ4PE4yVbwvVqWod4QvCV9C3mqNtS3fYYc1amHRgGkz1VhZiRKEMVhlJWrIVYGwVRrwaLXIrax3cBBliwZUa0FtQcABc6s0aNxJY1GD3ILdI7RG7NBJuJNrmTdhIBetwQKydxCtVb2I_BBWqnoKJFz-oopMlRbepN17FDyKqinRQIaleiPhmvBkb21jSPBM-Ibwza5TmSq9I3zzQfiGfctMfjzU3wifTwldE_rc_f65xxpyVTWydFHZvTRduN4ciWg-StiTsQUJ1nl3yik8_RrAUyV0vifBWqv8BW0pXpBE1PnSmYlo982bpr_CA1nnZVsgkGBVYaX0Owm-jb06KFl4r4TG1Ko0e7dYCfNCeJKr2lhoZW2T1BL-DGmq0VgtcwuFsCIlfDXcPkH4xRrduNJTrW1amxI-BxIvOxcAwK_Ilaism9YCCdbgc_FMgmdhTFthKkpXfQUJVixy4fCkd2dOgs8xO_NfBD37eoI9g7saJTwx8gNTC9JDUhIs_eEKHMYSutWTN9G6T4dOwqUk4dqr-4BHF9y1i7QCCTZAj9stkODZW-shSbzu_Tsd-F9f5_1OuregLm2fLOLYa58eYQhPWOTuDCqVsPkeDQmXLHIbZLwA98tqMHpTMQPU3XxrtK2ugfBnjbK2qBuNNs2FsSRYOWeimY-pW7p7UKfUDNtkbMLvvVGbypXWmLsjl8OhVaDxbcFtsTuhRRS2bZ27duLCyVo77mJOsVRvqGEvyu3QpzSatnS17kAtGutQpIFC1a6j9YoGQdbG6taDm9sd351O_7P67d8sPXeEodamr60oAICHIRAe-Bq6LxGFPKHw-DNo097kSZFFMWPRY92z4kn1wvSMxwmbU_oIYaxPL-NlD-JldB5yFkURnd1FvYS87xZPaBCHsyikCU_uoFxj3EeJWRxHjM1pxDiLbzN8i3If5ynmNIzD-YwG8ZzH8yD6a5wB6dwG2tpP7wLyvXATtzZ-47g-ML7jN8_zo_X0tsiM_hDsRWiHSh1eAeBYVdRJv74pXUBjtRvJuvDd8NKsk_Qds8KKhEvqB_PqA7W6Qm4yrUThmkQGnQXmLKCR14J51eBrBi_-_uCJl78QfDF7ubU6O51zJ9dpzW7DKl5FNIN37268fGEkXruDD9eL-MrduIhVNm5iwGkbh-trN5V2mToB8jMQ_WtAdgs49CfGk3GCBjx2ax9GK_U4T4fG7NttcQIcJHmQ3C0kJ-dX1AuGS5fWgPBV938vD1_0o68tLY4nwUsBVzVt06Dur2u0n4yJjdK-bRtRoSdvDvWX1WpEaz2v6wnt-6NO_YObsffxTl3Tz-u6_1wn7L_fv__56-9r-O2P38-b7W9m8POEfRfvkLkxV-JB1NY1FM-DPUf3nEaaYUpiAVutqjPDt3thPXMvcCvrjro7xUarnRZVhbofkW4Feh08WqyNn5J85aWHpSlOq-JsOsAp_OEW6zS0HxNpT6QqUbeiTDtDX-DHX2RWAH0UB8zTNjmyKO0YapoKa7XMWouEJ4QnneXUcc6Bh3XfkxO3vMl73lmXRtW993DLLZ_G9p138fqGTY-BeubigO6Tsi49Y9CBnXXJmV9y3JF3njqP4-jJ36fMryftNx5-nQP65zH3eOZL5a7gdQn8vVHaj8x_APKT9__xal78nP0_KRZBMQ_mYoILFsXBLGTJjE72iyAOMd7GlIl5TmcBYs6CecxmCY85D_NsIhec8oAyxlg4m1E6DXiYJGybF1FCM1EkjnBUQpbTsjxU7qF7Io1pcRHO50E8KUWGpRleSuiFE3rK2p0hM1pKY81ZzUpb4vgVhVufpSi6J_K-ZUhVw5vrVrcvI3yjunrJYEDWVquizc9vGZ4u3jJMWl0url4gSLtvs2muKsI3zr3-76nR6n_oRsTGh2gI3_go_x8AAP__klAc4w">