<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/59937>59937</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Clang++: Bad code generation when extracting values from vector registers introduced by auto-vectorization
</td>
</tr>
<tr>
<th>Labels</th>
<td>
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
He3lixxx
</td>
</tr>
</table>
<pre>
Clang++ currently generates suboptimal assembler code for extracting values out of a vector register if the usage of the vector register was caused by auto vectorization. [Godbolt with the examples below](https://godbolt.org/z/1EbscxvnE).
When compiling this code with `clang++ -std=c++20 -O3 -march=rocketlake`:
```cpp
#include <memory>
void compare_to_bytemask(const uint8_t* __restrict data_, uint8_t compare_value, uint8_t* __restrict output_) {
auto* __restrict input = std::assume_aligned<16>(data_);
auto* __restrict output = std::assume_aligned<16>(output_);
for (size_t i = 0; i < 16; ++i) {
output[i] = input[i] == compare_value ? 0xff : 0;
}
}
auto extract(const uint8_t* data, uint8_t compare_value) {
alignas(16) uint8_t matches[16];
compare_to_bytemask(data, compare_value, matches);
return *reinterpret_cast<uint64_t*>(matches);
}
```
clang currently correctly autovectorizes the `compare_to_bytemask` function, but extracting the lower half of the result in `test` is done using these instructions:
```
.LCPI1_0:
.quad 255 # 0xff
.quad 65280 # 0xff00
.quad 16711680 # 0xff0000
.quad 4278190080 # 0xff000000
.LCPI1_1:
.quad 1095216660480 # 0xff00000000
.quad 280375465082880 # 0xff0000000000
.quad 71776119061217280 # 0xff000000000000
.quad -72057594037927936 # 0xff00000000000000
extract(unsigned char const*, unsigned char): # @extract(unsigned char const*, unsigned char)
vmovq xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
vpbroadcastb xmm1, esi
vpcmpeqb k1, xmm0, xmm1
kshiftrb k2, k1, 4
vmovdqa64 ymm0 {k1} {z}, ymmword ptr [rip + .LCPI1_0]
vporq ymm0 {k2}, ymm0, ymmword ptr [rip + .LCPI1_1]
vextracti128 xmm1, ymm0, 1
vpor xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 238 # xmm1 = xmm0[2,3,2,3]
vpor xmm0, xmm0, xmm1
vmovq rax, xmm0
vzeroupper
ret
```
For the same code, GCC generates this assembly:
```
extract(unsigned char const*, unsigned char):
vpbroadcastb xmm0, esi
vpcmpeqb xmm0, xmm0, XMMWORD PTR [rdi]
vmovq rax, xmm0
ret
```
May be relevant: When the value is extracted from a vector that was defined by the programmer using GCC vector extensions, the generated assembly is fine. This function:
```cpp
auto manual_vector(const uint8_t* __restrict data, uint8_t compare_value) {
using vec_u8x16_t = __attribute((vector_size(16))) uint8_t;
auto comparison_vector = compare_value - vec_u8x16_t{};
auto comparison_result = *reinterpret_cast<const vec_u8x16_t*>(data) == comparison_vector;
return reinterpret_cast<uint64_t&>(comparison_result);
}
```
is compiled as
```
manual_vector(unsigned char const*, unsigned char): # @manual_vector(unsigned char const*, unsigned char)
vpbroadcastb xmm0, esi
vpcmpeqb xmm0, xmm0, xmmword ptr [rdi]
vmovq rax, xmm0
ret
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysWFuP6jgS_jXmpdTIdsiFBx4aOMyutEczGo00-xY5SQHeTuK07dB0__qVnQTCrafnzEEIcqn66uJy1ZcIY-SuRlyQcEnC9US0dq_04l8YlPJ4PE4yVbwvVqWod4QvCV9C3mqNtS3fYYc1amHRgGkz1VhZiRKEMVhlJWrIVYGwVRrwaLXIrax3cBBliwZUa0FtQcABc6s0aNxJY1GD3ILdI7RG7NBJuJNrmTdhIBetwQKydxCtVb2I_BBWqnoKJFz-oopMlRbepN17FDyKqinRQIaleiPhmvBkb21jSPBM-Ibwza5TmSq9I3zzQfiGfctMfjzU3wifTwldE_rc_f65xxpyVTWydFHZvTRduN4ciWg-StiTsQUJ1nl3yik8_RrAUyV0vifBWqv8BW0pXpBE1PnSmYlo982bpr_CA1nnZVsgkGBVYaX0Owm-jb06KFl4r4TG1Ko0e7dYCfNCeJKr2lhoZW2T1BL-DGmq0VgtcwuFsCIlfDXcPkH4xRrduNJTrW1amxI-BxIvOxcAwK_Ilaism9YCCdbgc_FMgmdhTFthKkpXfQUJVixy4fCkd2dOgs8xO_NfBD37eoI9g7saJTwx8gNTC9JDUhIs_eEKHMYSutWTN9G6T4dOwqUk4dqr-4BHF9y1i7QCCTZAj9stkODZW-shSbzu_Tsd-F9f5_1OuregLm2fLOLYa58eYQhPWOTuDCqVsPkeDQmXLHIbZLwA98tqMHpTMQPU3XxrtK2ugfBnjbK2qBuNNs2FsSRYOWeimY-pW7p7UKfUDNtkbMLvvVGbypXWmLsjl8OhVaDxbcFtsTuhRRS2bZ27duLCyVo77mJOsVRvqGEvyu3QpzSatnS17kAtGutQpIFC1a6j9YoGQdbG6taDm9sd351O_7P67d8sPXeEodamr60oAICHIRAe-Bq6LxGFPKHw-DNo097kSZFFMWPRY92z4kn1wvSMxwmbU_oIYaxPL-NlD-JldB5yFkURnd1FvYS87xZPaBCHsyikCU_uoFxj3EeJWRxHjM1pxDiLbzN8i3If5ynmNIzD-YwG8ZzH8yD6a5wB6dwG2tpP7wLyvXATtzZ-47g-ML7jN8_zo_X0tsiM_hDsRWiHSh1eAeBYVdRJv74pXUBjtRvJuvDd8NKsk_Qds8KKhEvqB_PqA7W6Qm4yrUThmkQGnQXmLKCR14J51eBrBi_-_uCJl78QfDF7ubU6O51zJ9dpzW7DKl5FNIN37268fGEkXruDD9eL-MrduIhVNm5iwGkbh-trN5V2mToB8jMQ_WtAdgs49CfGk3GCBjx2ax9GK_U4T4fG7NttcQIcJHmQ3C0kJ-dX1AuGS5fWgPBV938vD1_0o68tLY4nwUsBVzVt06Dur2u0n4yJjdK-bRtRoSdvDvWX1WpEaz2v6wnt-6NO_YObsffxTl3Tz-u6_1wn7L_fv__56-9r-O2P38-b7W9m8POEfRfvkLkxV-JB1NY1FM-DPUf3nEaaYUpiAVutqjPDt3thPXMvcCvrjro7xUarnRZVhbofkW4Feh08WqyNn5J85aWHpSlOq-JsOsAp_OEW6zS0HxNpT6QqUbeiTDtDX-DHX2RWAH0UB8zTNjmyKO0YapoKa7XMWouEJ4QnneXUcc6Bh3XfkxO3vMl73lmXRtW993DLLZ_G9p138fqGTY-BeubigO6Tsi49Y9CBnXXJmV9y3JF3njqP4-jJ36fMryftNx5-nQP65zH3eOZL5a7gdQn8vVHaj8x_APKT9__xal78nP0_KRZBMQ_mYoILFsXBLGTJjE72iyAOMd7GlIl5TmcBYs6CecxmCY85D_NsIhec8oAyxlg4m1E6DXiYJGybF1FCM1EkjnBUQpbTsjxU7qF7Io1pcRHO50E8KUWGpRleSuiFE3rK2p0hM1pKY81ZzUpb4vgVhVufpSi6J_K-ZUhVw5vrVrcvI3yjunrJYEDWVquizc9vGZ4u3jJMWl0url4gSLtvs2muKsI3zr3-76nR6n_oRsTGh2gI3_go_x8AAP__klAc4w">