<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/137422>137422</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[X86] Suboptimal code for AVX-512 narrowing
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
dzaima
</td>
</tr>
</table>
<pre>
This code, compiled via `-O3 -march=znver4`:
```c
#include <immintrin.h>
#include <stdint.h>
void narrow_u32x16x4_to_u8x64(uint8_t* dst, __m512i x0, __m512i x1, __m512i x2, __m512i x3) {
__m512i inds = _mm512_set_epi8(
124, 120, 116, 112, 108, 104, 100, 96,
92, 88, 84, 80, 76, 72, 68, 64,
60, 56, 52, 48, 44, 40, 36, 32,
28, 24, 20, 16, 12, 8, 4, 0,
124, 120, 116, 112, 108, 104, 100, 96,
92, 88, 84, 80, 76, 72, 68, 64,
60, 56, 52, 48, 44, 40, 36, 32,
28, 24, 20, 16, 12, 8, 4, 0
);
__m512i x01 = _mm512_permutex2var_epi8(x0, inds, x1);
__m512i x23 = _mm512_permutex2var_epi8(x2, inds, x3);
__m512i x0123 = _mm512_mask_blend_epi64(0xF0, x01, x23);
_mm512_storeu_si512(dst, x0123);
}
```
produces:
```asm
narrow_u32x16x4_to_u8x64:
vmovdqa64 zmm4, zmmword ptr [rip + .LCPI0_0]
vmovdqa64 zmm5, zmmword ptr [rip + .LCPI0_1]
vpshufb zmm1, zmm1, zmm4
vpshufb zmm0, zmm0, zmm5
vpshufb zmm3, zmm3, zmm4
vpshufb zmm2, zmm2, zmm5
vporq zmm0, zmm0, zmm1
vporq zmm1, zmm2, zmm3
vpmovsxbd zmm3, xmmword ptr [rip + .LCPI0_3]
vpermi2d zmm3, zmm0, zmm1
vmovdqu64 zmmword ptr [rdi], zmm3
vzeroupper
ret
```
instead of the more direct version that gcc produces:
```asm
narrow_u32x16x4_to_u8x64:
vmovdqa64 zmm4, ZMMWORD PTR .LC0[rip]
kmovb k1, BYTE PTR .LC1[rip]
vpermt2b zmm0, zmm4, zmm1
vpermi2b zmm4, zmm2, zmm3
vmovdqa64 zmm0{k1}, zmm4
vmovdqu64 ZMMWORD PTR [rdi], zmm0
ret
```
The code implements a general 64-element `u32` to `u8` narrow, and should have 2x higher throughput than using `vpmovdb` as clang currently does via autovectorization on both Intel and AMD (and allows doing merging of multiple results via a blend instead of insert, which can run on more ports), so that's perhaps a separate thing that could be improved. I believe similar approaches should get a ~2x throughput boost for all narrowing conversions, on both Intel and AMD.
https://godbolt.org/z/ax7Yda7Ps
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJy8Vk2P2zYQ_TX0ZbAGRX364IN3HQMBGiRIF23Si0GJtMVGFBWS0io-9LcXJOVd2eskKNDWMDAW-fiG8-ZRJjVGHFvO1yi9R-l2QXtbK71mJyokXZSKfVs_1sJApRhH5AEqJTvRcAaDoIAyfPc-hjtJdVWjeHtqB64TlGEUbxD23wyHb-UeSCzaqukZBxQ_CClFa7VolzWK37yaNpaJ1j7P4c2gBIOWaq2e9n1Mxigbk71V-74YswSRohetLfYWkQ0wY91e93uZRkTAiC-eoosncvEUI7IClN8jvAGA53HRMgMo3sJeuoG94XbPO1EgUkxI94lI4sgi4vNFURaCzxDhIoQAwR6ycogZAaw8FooihCQEj4U8CyFAMg9xhXuCLGDSgEkDJgk0SaBJAiT2kJhcZSYBGyqAUAGECiBUEAAQAPhy-b9V-T8v_Pz5HwSAawnCakRWKL42zIijuV86rmVv-UgGqs_GCa50znLRufIWD4l_ykMueOLv7eeSSVLzZV82vGWOxx8gPO78lkbsT8hIrqgm61uleb83InWqFNNR8_zPcJRv52cf4U2nFesrbqYXw3mGGonw5ruH2oNhkGpgX2mWTJ06Sek7cJLySWkGndWA0nstOkDkHpa_PHx4i_cYpdtZf19ITlKmP18eXS3vTN0fSrcomhafY-L3-DKPp_FzTG_TxNN0PKd5DSPTNLnJpvTXIMh10ug2LLpiiy9gUg1mLNmLzH5v4w91iiedBmdNQc6L55Xd3JNvRz_v6UUSJhzvjU2euFZ913E9G9TcXtlNtMZyykAdwNYcpNIcmNC8sjBwbYRqwdbUwrGq4L-w5h_v3v3-_uMWPjx-dELhoNulpb5INZQu-qbcf358c4ZHt-BeYEvKmcD42Tq3mu77Uc7nb3b9dQUY5fdfIneGbzjzuW_zEq87hn_YHIQ3jzX3NwoQsmu45K01QOHIW65pA1lyx8Oou2H0MUEZBqv8Q-F-h664ZLRlYGrVNwxqOnAgI9TiWHMNttaqP9Zdb12rW-iNaI-OwvuclY6HGqga2h6h6rXmrW2-AVPc-LsN7a0aeGWVFidqnWNUC6WyNbxtLW985s27LSBSuJ-0adSTAaZcFsn10UV1ANk3VnQNB81N39iJG_ybF2Y2Fa3h2r9Kn2pR1VDRFnTvk3r3dkpb416w5AGM8u5FJDfQcV3TzqlneEc1tRxs7XJ7f1demdLrrNXA2RLeQskbwQcORkjRUA2067SiVc3NWcojt0DhLzLOVSyVMhYOSrtapxa4RJVqpzPl_4BuqrQMba-t7fw5IztEdkfFStXYpdJHRHYnRHZ0zD8zmn8wC7aO2Spe0QVfR3mSZiTDcbKo14wRXB7SOF-tktWBJFGR4wKTmLGEccKihVgTTFKckAxjnEfJMq-qAvM8rg5sVSRpgRLMJRXNsmkG6XIvhDE9X0dxnhCyaGjJG-Nvw4S0_An8LCLEXY712i26K_ujQQluhLHmhcYK2_hr9KciQ-kWfu1L1VkhaROs7pTb_PbpLo3Ii3qLXjfrK1mErftyWSmJyM6xT-Gu0-pPXllEdn5PBpHdtOlhTf4OAAD__3CkVLQ">