<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/102066>102066</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Sub-optimal bool/masked vector-vased operations (mainly on AVX-512)
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
zephyr111
</td>
</tr>
</table>
<pre>
Hello,
I found out that the following (OpenMP) code generate a sub-optimal code:
```lang-cpp
float compute(const float* v, const uint8_t* m, int size) {
float vsum = 0.0f;
int vcount = 0;
#pragma omp simd reduction(+:vsum) reduction(+:vcount) simdlen(16)
for (int i = 0; i < size; ++i) {
const float tmp = v[i];
if (m[i]) {
vsum += tmp;
vcount += 1;
}
}
return vsum / vcount;
}
```
On CPUs supporting AVX-512, this code results in the following assembly code for the main loop ([see on GodBolt](https://godbolt.org/z/P9E3bbz4Y)):
```
.LBB0_5:
vmovdqu xmm2, xmmword ptr [rsi + rdx]
vptestmb k1, xmm2, xmm2
vmovups zmm2 {k1} {z}, zmmword ptr [rdi + 4*rdx]
vaddps zmm0 {k1}, zmm0, zmm2
vpmovm2d zmm2, k1
vpsubd zmm1, zmm1, zmm2
add rdx, 16
cmp rcx, rdx
jne .LBB0_5
```
Note that unrolling has been disabled for sake of clarity since it does impact the problem.
The `vpmovm2d` instruction can be merged with the `vpsubd`. Indeed, the later should be masked for better performance. This issue strangely disappears (i.e. `vpaddd zmm1 {k1}, zmm1, zmm2` is generated instead)when the variable is incremented by 2 or more ([see on GodBolt](https://godbolt.org/z/GKcx63eMW)). It seems due to an optimization for the special case of 1 producing a less efficient code (at least for AVX-512 targets). Note the workaround is not a viable option since it is weird and it sometimes cause the code to be even less efficient...
The is another issue when the input variable `m` is of type `const bool*`. The code is even less efficient for no apparent reason ([see on GodBolt](https://godbolt.org/z/fccq5xYKf)):
```
.LBB0_5:
vmovdqu xmm3, xmmword ptr [rsi + rdx]
vpsllw xmm4, xmm3, 7
vpmovb2m k1, xmm4
vmovups zmm4 {k1} {z}, zmmword ptr [rdi + 4*rdx]
vaddps zmm0 {k1}, zmm0, zmm4
vpand xmm3, xmm3, xmm2
vpmovzxbd zmm3, xmm3
vpaddd zmm1, zmm1, zmm3
add rdx, 16
cmp rcx, rdx
jne .LBB0_5
```
We can see that the `vpmovzxbd` can be merged with `vpaddd` (thanks to masking) similar to previously. Additionally, the instruction `vpand` and `vpsllw` are generated because of the `bool` type while there there should be AFAIK no more operation to perform than on a `uint8_t` type. This is a frequent issue with the `bool` type on SIMD code (not just AVX-512 but also in AVX-1/AVX-2). Here again, setting the increment to 2 strangely helps on AVX-512 target (it removes the `vpmovzxbd` and `vpand` instructions but not `vpsllw`), but not really on AVX-2 targets.
A good generated assembly code (with disabled unrolling) should be:
```
.LBB1_3:
vmovdqu xmm0, xmmword ptr [rsi + rcx]
vptestmb k1, xmm0, xmm0
vaddps zmm2 {k1}, zmm2, zmmword ptr [rdi + 4*rcx]
vpaddd zmm4 {k1}, zmm4, zmm10
add rcx, 16
cmp rcx, rax
jb .LBB1_3
```
Possible related issue: [this](https://github.com/llvm/llvm-project/issues/55231) and [this](https://github.com/llvm/llvm-project/issues/46917).
Such a problem is apparently not related to Clang or OpenMP, but to the LLVM optimizer since [the same issue has been seen on the LLVM-based ISPC project](https://github.com/ispc/ispc/issues/2920) generating LLVM IR (the generated assembly code tends to be even less efficient than the one of Clang+OpenMP though I am not sure exactly why).
Fixing the 2 main issues would be great since it should speed-up both OpenMP and ISPC codes (with conditional statement) as well as possibly auto-vectorized native codes!
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysWF9z2zgO_zTMCyYeif4T-8EPdnLezbS9zVx7u7dPHUqEbTYUqZKUHOfT34CUZDtNtr3rdjpiRJAg8APwI2ThvdoZxCWbrtn07ko0YW_d8hnr_dHleX5VWHlc_opaW8ZvWXbHslV63sPWNkaCbQKEvaAHwtZqbQ_K7IDx-W81mg8PjC-gtBJhhwadCAgCfFNc2zqoSugoY-PVuWo2y9J_LczuuqzrNL3VVgQobVU3ARmfl9b4AHGW8RW0jN9CmmuUCfPPcbaiWWUCePWMZAu7WSd1AJA2Q-ubCtj4DrJRtmXjMzltbEvbmJDkg_C0hPFx7cSuEmCrGryqJDiUTRmUNYzPGV-z8YpOoMNfkUTtJKOtGkmSzxhfnBlpHcFJtqjBjPjnbfJqvAZSxtfq0sEzgCBUddzbsulasendK55Eh7d0VNUveolX_y8hRg7ckeYLzIY1HW5pVf7NGnZzd4Zi_3KachgaZ_qjNp2-k93Djj5bzhX8ZuD24d8efFPX1gXKyNXv_7me5pzyIeyVT0np0Dc6eFDmRf4K77Eq9DEtoxCQvBLKgLa2JpTYdO0RwRr4xcq11SEiNt-HUHvKaL5hfLOzsrA6jKzbMb55ZnzzsPjHuCieJ39SkPnirdxPr6P363X2eTosgrayrfzawFNVRVeequpgnYQ6OGDTtfOKEAcnn8iaC8DbOqAPVdG_P-adAj6Ml-sr2za1h-eq4pQGjzm7uaM_ngl7fkuCi7NlOnvC-Ori_FZIWXug9dmgqFOQdePLs-vKthWX_ftzZ-Vj3uusfVPIKMg7FfnrqoRMWsgkfgv57FJcVnUSl1FMqy7kXwzGsY_E2zn3TxswUWFjnNWa0mgvPBSIBqTyotAoYyp58Yhgt1Bq4VQ4glemRFABpEUPqqpFmfi0drbQWI3Oj_m0R2CzrEeIzTJQxgeXiAVKYaBAqNDtUMJBhX3UFHcQZGyWjeDeSESZSgFBi4AO_N42Wsa9wj92lhYYSFaj21pXCVPiCD5R9SjvGwQfnDA71MfoX12jcD5y1QhH6UghZRell5E_hYs88MP9IKM7KCTji8MeU2G2winCjxYqUzqs0NDS4ggcrIPKOvyJmvzlXfk0G-OHP1JNjuA-gEesPMgGIVgQBuKFpZ5FRLknBF9jqegWEz5GNKeQyaaMFAIavQfcblWp0IREJYzPRQCNgqjZup6WIAi3w-Dj6V0mIRysexQu3rPKg7EBBLQJCDLHmlPqKA8HVE6CoMUBvK0wqAo9lKLxSV00IFiKMbZoXtg3Gn2TZ8qDMDbs0XUBHwKiTN2EU1jYLKu6ONothGMdp9IFVFirGV_FxPvUW6H8ayZERIwFUdfC0btD4a35ichuy_Lr9OnPd9ufZdvx_8y2XusD0JZJtzWquBkIrLJtwatv2HjyJgtPfoqFe30_QMYvTagpp84wGL9xYZBLz0-FPJH2acdLjcQLr5D3-O8h7x9n7T8wUibl1tDG9vRKvlBWv8KpA7mRnPF52Avz6Km4iD2V2XUtndLC0WztsFW28fo4gpWUiopXaH3sOficwpNuE1UT8om7tT7ECYdnVFlgqm4qumR4rLZZlmrwsFc6lr7rnyeWX21W9--o2CJ32po00ulkbKJ7AsRQxQlS3DfVne7hHgABW4dfGyrXjiTOLp0Lc6yBj_cf7gYiJEL70vgwkGDRBBDaW-rHaC5nfEMjj7T4KzkgdkIZQs1jiH1dQq-7Esh6fnYt7VHXns69pNl4RxG7VLZF_2rMB-C7QJwFyEc7yfjzyESCuR1EDim8_dEDv19Q7Ap21sqzcF52nYzPI5RD7zD0FTG5-kh-n9Pyz-MTp53xSkdt2V9RW_kXjeTAWdkwvsU1_CXX8O9x16sHD-3E5KW-SU8kvQ0DdZQ_RB3iZd9XQA_d2-TxYL1XdP851Kl1oQJg4xV5Q18Zr15SKuybYlTaivGN1m0_XNfOfsEyML6Jajzjm-mUj3OKdkzHv0fnZLbIb6igzj352JR7EH3DGcu6u4P1scvn5GGwcEvf5dR19V_4KeuDjYX0_v3vH_pmidrK2KBE0xG8qLAjiaEz9vSwZth7XQiPEu4_PtxCb_33PFa-Ls-HzlO-4FSWfYERW0Tr7v-VKBvfLL2ARvq3W6XEjKTAmki-ERLG1wkRCHvb7PZwD6KK4PnGIeCTKAnNw_4Y6ewc_o166rmMp6_M5AQcerreORTh1O511e9rRHnd1FDYsO_iEXMlwkeu-IFGSmv6ewd8ECESZswtah21prFOCX0E0QR73WIZrFPPKMGIoNrUu3nG8yu5HMvFeCGucJnfcD5fjBfzydV-KSbZfLEoyhyzm3J6U8z5TZnxSSnLCU7LyexKLXnGJ9k8m_J8vJhMRmOxwK3kOeeZnBU8Z5MMK6H0iPKXOrmrCMUyz3g2m11pUaD28fcqzg0eupLjnE3vrtwyJn3R7DybZFr54E9qggoalx_PfntKnemm-95J3l63Mf2G-zDiRxE5kXn8FWFx1Ti9_P-rsHOnXfL_BgAA__8W7fVy">